Unc lassi f  led 


AD-A199  940  ^ 

— 


3ORT  DOCUMENTATION  PAGE 

r^Z  4  I  lb  RESTRICTIVE  MARKINGS 


IN  / 


2b  OEClASSinCAriON/OOWNGR 

_  N/A _ 


4  PERFORMING  ORGANIZATION 


_  I  3  DISTRIBUTION/ AVAILABILITY  OF  REPORT 

Approved  for  public  release; 
istribution  unlimited 


5  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 

AFOSR.TR.  8  8-0fl7s 


6a  NAME  OF  PERFORMING  ORGANIZATION  6b  OFFICE  SYMBOL  7 a  NAME  OF  MONITORING  ORGANIZATION 

Massachussetts  Institute  (if  applicabfer^ 

of  Technology  AFOSR/NM 


6c  ADDRESS  (City,  State,  and  ZIP  Code) 

77  Massachusetts  Ave.  -  Room  33-109 
Cambridge,  MA  02139 


7b  ADDRESS  (City,  State,  and  ZIP  Code) 
Building  410 

Bolling  AFB,  DC  20332-6448 


8b  OFFICE  SYMBOL 
(If  applicable) 

9  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 

NM 

AFOSR-84-0160 

10  SOURCE  OF  FUNDING  NUMBERS 


8a  NAME  OF  FUNDING /SPONSORING 
ORGANIZATION 

AF0SR 


8c.  ADDRESS  (City,  State,  and  ZIP  Code) 
Building  40 

Bolling  AFB,  DC  20332-6448 


1 1  TITLE  (Include  Security  Classification ) 

Approximate  Evaluation  of  Reliability  and  Availability  Via  Perturbation  Analysis 


PROGRAM 

PROIECT 

El  EMENT  NO 

MO 

6.1102F 

2304 

WORK  UNIT  I 
ACCESSION  NO  : 


12  PERSONAL  AUTUOR(S) 

Bruce  K.  Walker,  Sin-Kwong  Chu  and  Norman  M.  Wereley 


13a.  TYPE  OF  REPORT  13b  TIME  COVERED  14  DATE  OF  REPORT  (Year.  Month,  Day)  IS  PAGE  COUNT 

Final  from  6/84  io  9/87  March,  1988 


17.  _ COSATI  CODES _  _ I  18  SUBJECT  TtRMS  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

GROUP  T  SUBGROUP  |  Markov  models,  semi-Markov  models,  reliability, 

avoidability 


19  ABSTRACT  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

The  progress  on  a  three-year  effort  to  examine  approximate  reliability  evaluation  techniques 
for  fault  tolerant  control  and  sensor  systems  is  described.  The  motivation  for  the  work  is 
provided  by  the  fact  that  the  reliability  models  for  these  systems  tend  to  be  finite  state 
semi-Markov  models  with  large  dimensions  that  evolve  relatively  slowly  in  time  due  to  the 
rare  occurrence  rate  of  component  failures.  The  transient  behavior  of  these  models  is  of 
interest  because  the  steady  state  behavior  is  trivial  and  not  of  practical  importance.  The 
evaluation  of  the  transient  behavior  of  such  models,  however,  is  intractable  even  for  rela¬ 
tively  simple  system  architectures  because  of  the  widely  varying  rates  at  which  events  occur 
in  the  model.  The  research  effort  concentrates  on  generating  useful  limit  theorems  that 
approximate  the  behavior  of  these  models  asymptotically  well  as  the  small  component  failure 
rates  become  vanishingly  small.  Using  the  work  of  Korolyuk  as  a  starting  point,  such  limit 
theorems  are  generated  for  both  continuous  and  discrete  time  models  that  are  representative 
of  fault  tolerant  system  behavior.  In  particular,  the  limit  theorems  of  Korolyuk  are  expand' 
to  cover  models  where  the  classes  of  the  decomposed  models  Include  trapping  states  when  the 


20  DISTRIBUTION/ AVAI1  ABILITY  OF  ABSTRACT  21  ABSTRACT  SECURITY  Cl  ASSIFICATION 

SJUNCLASSIFIEO/UNLIMIIED  □  SAME  AS  RPT  Gone  USERS  Unclaasj.f  led 


22a  NAME  OF  RESPONSI8I  E  INDIVIDUAL  |22b  TEI  EPIIONE  (Include  Area  Code)  I  22c  OFFICE  SYMBOL 


■vf'inpovfti 


MIMWIVaiTIVJ 


DD  FORM  1473. 84  mar 


83  APR  edition  may  be  used  until  exhausted 
All  other  editions  are  obsolete 


__  SECURITY  Cl  ASSIFICATION  Of  [U1S 

88  10  5  140 


BLOCK  19.  CONTINUED 


AFOSR-TK-  8  8  *  0  975 

small  parameter  vanishes  and  to  cover  models  where  the  holding  times  are  not  necessarily  scalt- 
by  the  small  parameter.  Furthermore,  the  sufficient  conditions  for  these  theorems  are  express 
in  terms  of  properties  of  the  unperturbed  version  of  the  model  that  are  relatively  easy  to  ch< 
and  do  not  involve  generating  steady  state  limits  of  transition  operators. 

The  application  of  these  limit  theorems  to  some  selected  fault  tolerant  system  models  foi 
which  analytical  results  can  be  generated  symbolically  is  also  described.  This  motivates 
further  work  in  an  effort  to  expand  the  applicability  of  the  imite  theorem  results  to  the 
broadest  possible  class  of  models  that  can  result  from  the  analysis  of  fault  tolerant  systems. 
Preliminary  results  from  this  effort  are  described. 


Accesion  fur 

!  NT.'S  CuA&i 

i  b'MC  i  A3 

!  1 

□ 

LJ 

My  .  .  ..  j 

1  Di:  t  :  ■  ; 

■  ■■■■■■  V  C-v 

D,st  ! u.  - 

k 


AFO&K-1K-  6  8  ^  u  975 


Approximate  Evaluation  of  Reliability 
and  Availability  Via  Perturbation  Analysis 


Final  Technical  Report  on 


Grant  AFOSR-84-0160 


► 


Prof.  Bruce  K.  Walker 


Dept,  of  Aerospace  Engineering  &  Engineering  Mechanics 
University  of  Cincinnati 
Cincinnati,  OH  45221-0070 


Siu-Kwong  Chu 
Norman  M.  Wereley 

Department  of  Aeronautics  &  Astronautics 
Massachusetts  Institute  of  Technology 
77  Massachusetts  Avenue 
Cambridge,  MA  02139 


March,  1988 


Covering  the  Period: 


June  1,  198m-  -  September  30,  1987 


Prepared  for: 

Maj .  Brian  W.  Woodruff 
AFOSR/NM 
Building  410 
Bolling  AFB ,  DC  20332 


TABLE  OF  CONTENTS 


ABSTRACT  PAGE  2 

INTRODUCTION  PAGE  4 

PROGRESS  SUMMARY  PAGE  14 

SUMMARY  OF  SIGNIFICANT  FINDINGS  AND  FUTURE  WORK  PAGE  49 

PERSONNEL  PAGE  52 

PAPERS  AND  PRESENTATIONS  PAGE  53 

REFERENCES  PAGE  55 


2 


ABSTRACT 


■  The  progress  on  a  three-year  effort  to  examine  approximate  reliability 
evaluation  techniques  for  fault  tolerant  control  and  sensor  systems  is 
described.  The  motivation  for  the  work  is  provided  by  the  fact  that  the 
reliability  models  for  these  systems  tend  to  be  finite  state  semi-Markov 
models  with  large  dimension  that  evolve  relatively  slowly  in  time  due  to  the 
rare  occurrence  rate  of  component  failures.  The  transient  behavior  of  these 
models  is  of  interest  because  the  steady  state  behavior  is  trivial  and  not 
of  practical  importance.  The  evaluation  of  the  transient  behavior  of  such 
models,  however,  is  intractable  even  for  relatively  simple  system 
architectures  because  of  the  widely  varying  rates  at  which  events  occur  in 
the  model . 

The  research  effort  concentrates  on  generating  useful  limit  theorems 
that  approximate  the  behavior  of  these  models  asymptotically  well  as  the 
small  component  failure  rates  become  vanishingly  small.  Using  the  work  of 
Korolyuk  as  a  starting  point,  such  limit  theorems  are  generated  for  both 
continuous  and  discrete  time  models  that  are  representative  of  fault 
tolerant  system  behavior.  In  particular,  the  limit  theorems  of  Korolyuk  are 
expanded  to  cover  models  where  the  classes  of  the  decomposed  models  include 
trapping  states  when  the  small  parameter  vanishes  and  to  cover  models  where 
the  holding  times  are  not  necessarily  scaled  by  the  small  parameter. 
Furthermore,  the  sufficient  conditions  for  these  theorems  are  expressed  in 
terms  of  properties  of  the  unperturbed  version  of  the  model  that  are 
relatively  easy  to  check  and  do  not  involve  generating  steady  state  limits 
of  transition  operators. 

The  application  of  these  limit  theorems  to  some  selected  fault  tolerant 
system  models  for  which  analytical  results  can  be  generated  symbolically  is 
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also  described.  This  motivates  further  work  in  an  effort  to  expand  the 
applicability  of  the  limit  theorem  results  to  the  broadest  possible  class  of 
models  that  can  result  from  the  analysis  of  fault  tolerant  systems. 
Preliminary  results  from  this  effort  are  described. 


I .  INTRODUCTION 


I . 1  Motivation  and  Discussion  of  Problem 

Reliability  and  availability  have  become  two  of  the  prime 
considerations  in  the  design  of  control  systems  for  a  diverse  group  of 
applications  that  includes  flight  control  systems  for  both  aircraft  and 
spacecraft.  Considerable  effort  is  now  being  devoted  to  the  design  of 
highly  reliable  control  system  components  and  to  the  design  of  fault- 
tolerant  processors  for  online  control  computations.  Despite  the  success  of 
some  of  these  efforts,  the  extremely  high  reliability  goals  that  are 
becoming  commonplace  in  the  Air  Force  and  elsewhere  can  often  be  met  only  by 
designing  control  systems  with  built-in  component  redundancy.  The 
combination  of  a  redundant  system  architecture  and  a  redundancy  management 
(RM)  algorithm  constitutes  a  fault- tolerant  system  design. 

Predicting  the  performance  of  these  designs  is  an  important  and 
difficult  problem.  The  performance  is  judged  by  such  quantities  as  the 
reliability,  the  availability,  or  some  other  probabilistic  quantity  such  as 
the  average  measurement  accuracy  or  average  regulation  error.  Calculating 
these  quantities  is  an  important  problem  because  they  represent  the  criteria 
by  which  the  system  design  is  judged.  Such  calculations  are  difficult 
because  fault- tolerant  systems  are  subject  to  random  events,  such  as 
failures  and  RM  decisions,  that  change  the  nature  of  operation  of  the  system 
and  therefore  affect  the  values  of  the  performance  quantities. 

Several  papers  and  theses  have  introduced  the  concept  of  modelling  the 
random  behavior  of  a  fault- tolerant  system  by  generalized  finite-state 
Markov  models  [1-3,  17-181.  The  states  in  these  models  characterize  the 


status  of 

the 

system 

in 

terms 

of  the 

number 

of 

components  that  are 

operating, 

the 

number 

of 

these 

that  are 

failed, 

and 

the  status  of  the  RM 

decisions.  The  transition  behavior  among  these  states  must  then  be  derived 
from  the  probabilistic  behavior  of  component  failures  and  of  the  RM 
decisions  (including  errors  such  as  false  alarms  and  missed  alarms).  Once 
this  characterization  is  complete,  the  resulting  Markov  model  (or,  more 
generally,  semi-Markov  model)  can  be  used  to  derive  the  statistics  of  any 
relevant  quantity  that  is  dependent  upon  the  status  of  the  system.  Among 

these  are  the  reliability  and  availability  of  the  system,  but  the  statistics 
of  other  quantities  such  as  the  time  to  first  passage  of  a  particular  system 
status  or  a  performance  measure  dependent  on  the  system  state  history  can 
also  be  calculated. 

Despite  their  obvious  utility  for  fault- tolerant  system  performance 
analysis,  these  models  suffer  from  one  serious  drawback  that  has 

considerably  limited  their  use.  That  drawback  is  that  they  tend  to  be 
computationally  intractable  even  for  relatively  simple  fault- tolerant  system 
architectures .  This  intractability  is  the  result  of  a  number  of  factors: 

1.  The  number  of  states  in  the  reliability  model  can  be  large, 

particularly  for  complex  systems  comprising  many  components. 

Essentially,  there  are  as  many  states  in  the  model  as  there  are 
distinct  combinations  of  failed  and  unfailed  components  and  RM  decision 
statuses  for  which  the  system  remains  operative.  Even  the  exploitation 
of  symmetry  and  similarities  in  component  behavior  to  reduce  the  model 
order  can  still  leave  a  very  large  number  of  states  in  the  final  model. 

2.  The  transient  behavior  of  the  model,  not  the  steady  state  behavior,  is 
of  primary  interest.  Because  the  components  are  subject  to  failure, 
the  steady  state  condition  for  all  fault- tolerant  systems  lacking 
online  repair  capability  is  complete  system  failure.  Even  when  the 
recovery  of  failed  or  fail- indicated  components  is  possible,  a  steady 


6 


state  condition  for  the  reliability  model  may  not  become  established 
until  the  elapsed  time  is  greater  than  the  useful  lifetime  of  the 
system  (see  comment  4  below).  In  either  case,  the  transient  behavior 
of  the  model  is  of  interest  and  steady  state  analysis  techniques  do  not 
apply.  This  is  particularly  unfortunate  when  the  model  is  semi -Markov 
in  nature  because  the  transient  analysis  of  such  processes  requires  the 
evaluation  of  convolution  quantities  (integrals  or  sums,  respectively, 
for  continuous  or  discrete  time  models)  that  require  massive  amounts  of 
computer  memory  and  computation  time . 

3.  The  time  horizons  of  interest  are  often  very  long  in  absolute  terms, 
though  they  still  remain  short  relative  to  the  time  required  for  the 
reliability  model  process  to  reach  steady  state.  Typically,  a  fault- 
tolerant  system  will  be  used  for  operating  intervals  that  are  a 
significant  fraction  of  the  expected  lifetime  of  its  most  failure-prone 
components.  This  fraction  seldom  approaches  unity  because  the 
redundancy  level  of  the  failure-prone  components  required  to  satisfy 
reasonable  specifications  on  the  system  reliability  would  drive  the 
price  of  the  system  high  enough  to  justify  the  use  of  fewer,  more 
reliable  (and  therefore  more  expensive)  components.  On  the  other  hand, 
extremely  short  operating  times  would  yield  a  probability  of  failure 
for  any  component  that  is  so  low  that  the  extra  investment  in  fault- 
tolerance  would  not  be  justified  by  the  small  increase  in  reliability. 
In  light  of  Item  2  above  then,  the  transient  behavior  of  a  Markovian 
process  must  be  examined  over  time  horizons  on  the  order  of  the  mean 
time  to  failure  of  the  most  failure -prone  component.  Given  the  current 
emphasis  on  the  manufacture  of  highly  reliable  components,  these  time 
horizons  can  be  extremely  long. 
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4.  A  time  scale  separation  tends  to  exist  between  the  component  failure 
process  and  the  RM  decision  process.  Failures  tend  to  occur  only 
rarely  and  therefore  tend  to  have  large  time  durations  between  them. 
RM  decisions,  however,  must  occur  quickly  following  a  failure  and  tend 
to  occur  very  rapidly  relative  to  failure  events.  This  means  that  the 
Markovian  model  of  the  behavior  of  the  system  status  exhibits  "fast" 
modes  and  "slow"  modes.  This  time  scale  separation  provides  the 

motivation  for  the  behavioral  decomposition  methods  that  have  been 
investigated  by  us  and  by  other  researchers  in  the  field. 

The  goal  of  this  research  project  is  to  develop  a  method  that  generates 
approximate  solutions  to  the  generalized  Markov  process  models  that 
characterize  fault -tolerant  system  behavior  without  the  use  of  excessive 
computer  memory  or  computation  time.  The  behavioral  decomposition  alluded 
to  in  Comment  4  above  provides  the  basis  for  the  approach.  However,  the 
nature  of  fault- tolerant  system  models  is  such  that  extensions  to  existing 
theory  have  been  necessary  in  order  to  exploit  the  decomposition  approach. 
These  extensions  and  the  numerical  verification  of  their  validity  are  the 
primary  results  obtained  from  the  work  reported  here. 


1 . 2  Previous  and  Related  Work 

A  number  of  researchers  have  addressed  various  aspects  of  the  problem 
of  approximating  the  behavior  of  finite  state  Markov  processes  with  weak 
interactions  between  groups  of  states.  In  the  particular  context  of  fault- 
tolerant  systems,  a  number  of  papers  by  Trivedi  and  coworkers  have  examined 
the  use  of  Markov  models  for  evaluating  the  reliability  of  fault-tolerant 
data  processing  systems  [e.g.,  1-3].  Techniques  have  been  developed  for 
decomposing  these  models  along  the  lines  of  the  behavioral  decomposition 


8 


strategy  discussed  above.  However,  data  processing  systems  have  the  unique 
property  that  all  of  the  signals  associated  with  the  system  are  binary  and 
therefore  are  rarely  affected  by  noise.  As  a  result,  the  fault  detection 
system  rarely  indicates  falsely  that  a  fault  is  present  when  there  is  no 
fault.  The  model  decomposition  procedures  examined  in  [1-3]  implicitly  rely 
on  this  fact  because  the  model  is  always  assumed  to  take  the  form  of  a  low- 
order,  slow  process  induced  by  the  component  failure  process  upon  which  is 
superimposed  a  fast  process  representing  the  "fault  handling"  by  the  RM 
logic  following  a  fault.  This  structure  is  valid  only  if  "fault  handling" 
occurs  only  following  failures.  This  rules  out  the  possibility  of  false 
failure  indications  in  the  absence  of  failures. 

Although  many  of  the  techniques  developed  in  [1-3]  are  very  powerful 
and  easy  to  use,  the  limitation  just  discussed  renders  them  inapplicable  to 
fault- tolerant  systems  where  false  failure  indications  are  likely.  This 
includes  essentially  all  fault- tolerant  sensing  and  control  systems  because 
these  systems  are  affected  by  noise  and  dynamic  error  sources  that  make 
false  failure  indications  a  major  concern. 

In  the  general  area  of  Markov  processes  with  weak  interactions,  one 
notable  recent  work  is  the  article  by  Coderch  [4] .  This  paper  is  derived 
from  [5],  which  contains  an  extensive  description  of  previous  work  in  the 
area.  Much  of  the  work  preceding  [4]  applied  only  to  limited  classes  of 
finite  state  Markov  processes  and,  in  particular,  was  not  applicable  to 
semi-Markov  processes  or  to  processes  with  purely  transient  states.  In  [4], 
a  method  is  described  by  which  continuous  time,  finite  state,  weakly  coupled 
Markov  processes  without  transient  states  can  be  decomposed  into  transition 
operators  that  are  valid  for  increasingly  longer  time  scales.  The  result  is 
a  sequence  of  operators  that  describe  the  transition  behavior  of  the  process 


at  each  time  scale  such  that  the  multiple  time  scale  solution  for  the 
process  behavior  converges  to  the  actual  process  behavior  asymptotically  as 
the  small  parameter  representing  the  weak  interactions  converges  to  zero. 
Unfortunately,  the  method  does  not  apply  to  semi-Markov  processes  and  it  has 
not  been  extended  to  apply  to  discrete  time  processes.  Furthermore,  the 
method  requires  the  solution  of  very  complex  linear  algebra  problems,  such 
as  the  description  of  nullspaces  of  operators,  in  the  generation  of  the 
operators  that  are  valid  at  each  time  scale. 

Another  recent  effort  extended  the  results  of  [4]  to  finite  state 
Markov  processes  evolving  in  both  discrete  and  continuous  time  that  include 
special  types  of  transient  states  (called  "nonsplitting  transient  states"  in 
[7]).  Some  preliminary  results  of  this  effort  are  described  in  [6], 
Further  results  are  described  in  [7].  It  should  be  noted  that  the  results 
in  [6],  like  those  in  the  previously  cited  references,  are  applicable  only 
to  Markov  processes.  Some  results  on  semi-Markov  processes  are  included  in 
[7],  but  the  results  are  again  limited  to  processes  that  contain  only 
nonsplitting  transient  states.  This  rules  out  many  of  the  models  that 
represent  fault  tolerant  systems  because  these  models  consist  almost 
entirely  of  transient  states  (all  but  the  trivial  total  system  failure 
trapping  states),  many  of  which  are  not  "nonsplitting." 

It  should  also  be  noted  that  the  methods  of  [6]  and  [7],  like  those  in 
[4,5],  generate  a  description  of  the  behavior  of  the  process  in  sequentially 
longer  time  scales.  It  is  frequently  the  case  in  fault -tolerant  system 
analysis  that  the  behavior  of  interest  occurs  only  in  the  first  time  scale. 
This  observation,  combined  with  the  difficulty  that  the  methods  of  [6]  and 
[7]  have  in  dealing  with  transient  states  and  semi-Markov  processes, 
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suggests  that  an  alternative  method  for  dealing  with  these  processes  is  of 
interest . 


Much  of  the  work  reported  here  is  an  extension  of  the  work  reported  by 
Korolyuk,  et.  al.  [8,9].  These  results  apply  to  finite  state  semi-Markov 
processes  with  weak  interactions.  The  continuous  time  case  is  treated  in 
[8]  and  the  discrete  time  case  in  [9],  The  key  result  of  [8,9],  which  will 
hereafter  be  referred  to  as  Korolvuk' s  limit  theorem,  is  the  following: 


Theorem:  Given  a  perturbed  finite-state  semi-Markov  process  z€(t) 
whose  transition  operator  elements  P€„(t)  have  the  following 
dependence  on  e : 


p£ij  (b)  “  (Pjj  -  «qij)h^(t/e)  if  i.jeEj^ 

“  «qijhij(b/€)  if  ieE^.,  j/Ek 

with  Z  p..  -  1  and  where  p..  and  q..  are  of  order  1  and  where  the 

jeEk 

set  of  classes  {Ek)k_^  is  disjoint  and  exhaustive.  If  the  Markov 
chains  defined  by  the  p„'s  within  a  single  class  Ek  represent  an 
ergodic  Markov  process  with  stationary  state  probability  distribution 
for  each  k  (l<k<m),  then: 

Prob  (soiourn  time  from  class  E.  to  class  E  <t) 
e-*0  J  k  r- 

-  ->kr  •  (1  '  e'>k  C> 


where : 


7kr 


[  Z  *  (k)  Z  q  ]•[  S  tt  (k)  2  q  J'1 

i£Ek  j€Er  1J  i£Ek  j£Ek  1J 


[  2  * 
ieE. 


(k)  2  q  .  .  ]  •  [  2  it.  (k)  2  p.  .f  .  ] 

.  _  ^ij  J  1  .  _  1  .  „  *lj  lJ 

k  J£Ek  lcEk  J<Ek 


-  2  p .  .  r .  . 

1  j£Ek  ^  ^ 


and  where  is  the  mean  holding  time  for  the  holding  time  density 


h.  .  (t)  . 

ij 


Proof:  For  the  discrete  time  case,  the  proof  appears  in  [9],  The 
corresponding  proof  for  the  continuous  time  case  appears  in  [8]. 

The  conditions  and  consequences  of  Korolyuk's  limit  theorem  can  be 
summarized  in  words  as  follows.  The  interactions  between  the  states  of  the 
original  semi-Markov  process  are  weak  in  the  sense  that  the  transition 
behavior  depends  upon  a  small  parameter  e  such  that  when  e  is  zero  the 
process  decomposes  into  noncommunicating  classes  of  states.  The  original 
process  will  be  referred  to  hereafter  as  the  perturbed  process  while  the 
process  that  is  derived  from  it  by  setting  e  to  zero  will  be  called  the 
unperturbed  process.  The  form  of  the  transition  behavior  assumed  for  the 
perturbed  process  is  such  that  the  transition  probabilities  of  its  imbedded 
Markov  process  within  a  class  are  independent  of  e  while  the  interclass 
imbedded  Markov  process  transition  probabilities  are  all  at  least  first 
order  in  e.  Also,  it  is  assumed  that  the  holding  time  densities  associated 
with  all  transitions  for  the  perturbed  process  become  compressed  near  the 
origin  as  €  becomes  small.  Finally,  it  is  assumed  that  the  decomposed 
classes  of  the  unperturbed  process  are  all  ergodic.  When  all  of  these 
conditions  are  satisfied,  the  interclass  behavior  of  the  perturbed  process 
over  time  horizons  on  the  order  of  t/e  can  be  approximated  by  a  reduced 


12 


order  Markov  process  in  this  "slow"  time  scale.  This  process  is  called  the 
enlarged  Markov  process.  The  approximate  behavior  of  the  original  perturbed 
process  can  then  be  derived  by  expanding  the  enlarged  process  probabilities 
of  occupying  each  class  with  the  stationary  distribution  of  probability 
within  each  class  of  the  unperturbed  process  that  results  from  the 
ergodicity  of  these  classes.  The  parameters  of  the  enlarged  Markov  process 
are  expressed  in  terms  of  the  decomposed  transition  probabilities  of  the 
perturbed  process  and  the  mean  holding  times  associated  with  the  holding 
time  distributions. 

This  result  is  very  powerful  for  approximating  the  behavior  of  semi- 
Markov  processes  that  satisfy  all  of  the  conditions.  Unfortunately,  most 
models  of  fault- tolerant  system  behavior  do  not  satisfy  these  conditions. 
This  provides  the  motivation  for  most  of  the  work  on  this  project. 

1 . 3  Research  Goals 

The  research  goals  for  the  project  can  be  summarized  as  follows: 

1.  Extend  the  results  of  [8,9],  if  possible,  by  carefully  reviewing  the 
proofs  included  there  and  recognizing  points  where  the  restrictive 
assumptions  can  be  relaxed. 

2.  Extend  the  results  of  [8,9]  to  perturbed  semi-Markov  models  evolving  in 
continuous  time  where  the  holding  time  densities  do  not  depend  directly 
upon  the  small  interaction  parameter  e  but  rather  on  a  small  time 
scaling  parameter. 

3.  Conduct  investigations  on  several  continuous  time  models,  including 
some  with  nonergodic  classes.  Attempt  to  identify  theoretical  results 
regarding  such  models. 
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4. 


Develop  results  similar  to  [8,9]  as  extended  by  the  two  previous  goals 
for  discrete  time  semi -Markov  models  of  fault  tolerant  systems. 

5.  Develop  a  means  for  generating  the  exact  solution  to  both  continuous 
and  discrete  time  models  of  simple  fault- tolerant  systems  for  the 
purpose  of  comparison  with  the  results  generated  by  the  approximate 
technique . 

The  next  section  of  this  report  will  discuss  the  progress  made  on  these 
goals  during  the  project. 
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II.  PROGRESS  SUMMARY 

In  this  section,  the  work  of  the  past  three  years  is  summarized  and  is 
related  to  the  goals  discussed  above.  The  work  will  be  summarized  here  in 
approximately  the  order  in  which  it  was  accomplished.  Numerous  references 
are  made  to  [10],  which  is  the  S.M.  thesis  of  Siu-Kwong  Chu  that  was 
completed  under  the  support  of  this  grant.  This  thesis  was  included  as 
Appendix  A  of  [12].  A  draft  version  of  a  paper  [13]  that  was  derived  from 
this  thesis  is  included  as  Appendix  A  of  this  report.  References  are  also 
made  to  [14]  and  [15],  which  are  the  S.M.  thesis  of  Norman  Wereley  and  a 
paper  adapted  from  it,  respectively.  The  paper  [15]  is  included  as  Appendix 
B  of  this  report.  The  thesis  is  available  from  the  authors.  It  was  sent  to 
AFOSR  previously  under  separate  cover. 

2 . 1  Attempts  to  Extend  Korolvuk's  Results 

Fault- tolerant  system  models  tend  to  have  two  characteristics  that 
violate  the  conditions  imposed  on  the  semi -Markov  processes  examined  in 
[8,9].  One  is  that  the  holding  time  densities  do  not  compress  as  the  small 
parameter  representing  the  weak  interclass  interactions  is  made  smaller. 
The  reason  for  this  is  that  the  holding  time  densities  for  fault- tolerant 
system  models  are  determined  by  the  probability  mass  functions  of  the  time 
needed  for  various  sequential  fault  diagnosis  tests  to  reach  decisions.  The 
behavior  of  the  fault  diagnosis  tests  typically  occurs  in  the  "fast"  time 
scale,  but  it  is  not  altered  by  changes  in  the  failure  rate  of  the 
components,  which  is  usually  the  source  of  the  small  interaction  parameter 
in  these  models.  This  situation  is  illustrated  clearly  by  the  model  derived 
in  Chapter  3  of  [10],  which  is  the  9-state  model  referred  to  in  [11].  None 
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of  the  holding  time  densities  for  this  model  display  the  explicit  dependence 
on  the  scaled  time  t/e  that  [8,9]  assume  (see  Appendix  C  of  [10]). 

The  other  manner  in  which  fault- tolerant  system  models  often  violate 
the  conditions  assumed  in  [8,9]  is  with  respect  to  the  ergodicity  of  the 
classes  when  e-0.  Many  fault- tolerant  systems  include  RM  logic  that  shuts 
off  a  component  permanently  once  it  has  been  diagnosed  as  failed.  If  this 
diagnosis  is  the  result  of  a  false  alarm,  the  corresponding  system  status 
state  involves  no  failures  and  hence  tends  to  be  in  the  same  class  upon 
decomposition  of  the  model  as  other  no- failure  states  such  as  the  state 
where  no  failures  and  no  RM  decisions  have  yet  taken  place.  This  means  that 
these  classes  include  "splitting"  transient  states,  and  the  results  of  [7] 
are  not  applicable.  Also,  the  false  alarm  state  is  a  trapping  state  for 
this  class  when  the  failure  probability  (and  hence  e)  is  set  to  zero. 
Therefore,  this  class  of  the  unperturbed  process  is  nonergodic.  This  tends 
to  be  true  of  many  of  the  classes  of  states  associated  with  models  of  fault- 
tolerant  system  behavior  when  irreversible  RM  logic  is  used  by  the  system. 

A  major  part  of  the  early  research  on  the  project  was  devoted  to 
extensive  study  of  the  results  in  [8,9]  to  determine  whether  they  could  be 
directly  extended.  In  [11] ,  it  was  noted  that  the  ergodicity  of  the  classes 
is  actually  a  stronger  condition  than  what  is  sufficient  for  the  proofs 
presented  in  [8,9]  to  hold.  In  particular,  the  following  Proposition  was 
put  forward: 

Proposition:  The  results  of  Korolyuk's  limit  theorem  hold  if  the 
transition  operator  P^  of  each  class  k  of  the  unperturbed  process  is 


such  that  either  of  the  following  is  true: 


( 


L 


(1)  is  the  transition  operator  for  an  ergodic  Markov  process 

(k) 

with  unique  stationary  state  probability  distribution  }, 

or 

(2)  The  inverse  operator  [I  -  ^  exists. 

In  these  statements,  P^  and  are  defined  as  follows: 

P^  -  transition  probability  operator  for  the  imbedded  Markov 

process  governing  transitions  within  class  k  of  the  unperturbed 
process , 

and 

-  lim  p£  if  this  limit  exists 

n-**> 

1  n  i 

-  lim  -  2  P^  otherwise,  if  this  exists. 

n-*«  i-1 


Proof :  The  proof  of  this  Proposition  is  imbedded  in  the  proof 

presented  in  [9]  of  Korolyuk's  limit  theorem.  In  [9],  Korolyuk  assumes  the 
classes  of  the  unperturbed  process  are  ergodic,  which  guarantees  the 
existence  of  the  inverse  operator.  However,  it  is  only  the  existence  of  the 
inverse  operator  that  is  required  in  the  proof. 

The  conclusion  that  can  be  drawn  from  this  Proposition  is  that  the 
results  of  [8,9]  are  valid  when  the  weaker  sufficient  condition  represented 
by  (2)  in  the  Proposition  is  satisfied  by  each  class  in  the  model.  This  is 
»  a  considerable  generalization  of  Korolyuk's  results.  The  unperturbed 

process  derived  from  many  fault- tolerant  system  models  satisfy  this  weakened 
sufficient  condition.  In  fact,  a  result  will  be  stated  and  proven  in  a 
I  later  section  that  generalizes  Korolyuk's  limit  theorem  to  all  perturbed 
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semi -Markov  processes  that  have  a  corresponding  unperturbed  process  with  a 
particular  property.  This  property  is  possessed  by  all  perturbed  semi- 
Markov  models,  and  therefore  all  semi-Markov  fault  tolerant  system  models, 
that  have  unperturbed  model  classes  that  are  aperiodic  and  contain  no  more 
than  one  trapping  state.  Such  models  also  satisfy  the  weakened  sufficient 
condition  of  this  section. 

Despite  the  considerable  generality  that  the  weakened  sufficient 
condition  implies,  several  problems  still  exist  in  applying  the  results  to 
models  of  fault  tolerant  system  behavior.  The  first  problem  is  the 
nondependence  of  the  holding  time  densities  on  the  small  parameter  e .  This 
will  be  discussed  in  the  next  section. 

The  other  problem  is  that  models  for  fault  tolerant  system  behavior  are 
rarely  specified  in  the  standard  form  for  semi-Markov  models.  The  standard 
form  for  each  of  the  transition  operator  elements  for  a  semi-Markov  process 
is  the  product  of  an  imbedded  Markov  process  transition  probability  and  a 
holding  time  density.  For  most  fault  tolerant  systems,  the  transition 
operator  elements  are  derived  directly  and  therefore  do  not  take  this 
product  form,  although  the  decomposition  with  respect  to  e  is  clear.  The 
imbedded  Markov  process  transition  probabilities  can  be  determined  in  these 
cases  only  by  integrating  (or  summing)  the  operator  element  histories  with 
an  infinite  upper  limit.  The  weakened  sufficient  condition  requires  only 
the  calculation  of  the  "fast"  imbedded  Markov  process  transition 
probabilities  p^ ,  so  the  calculation  typically  converges  rather  quickly. 

However,  the  subsequent  calculation  of  f°r  each  of  the  classes  can  be 

difficult,  especially  if  the  Cesaro  limit  form  must  be  used.  This  latter 
calculation,  if  it  is  done  numerically,  also  amplifies  any  error  that  may 
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have  been  present  in  P^.  Finally,  to  check  the  weakened  sufficient 
condition,  it  is  necessary  to  find  the  determinant  of  the  operator  I-P,+ir,  . 

This  may  be  extremely  difficult  numerically. 

Because  some  of  the  calculations  leading  to  a  check  of  the  weakened 
sufficient  condition  are  subject  to  error,  we  chose  to  continue  our  research 
to  develop  other  conditions  that  could  be  checked  under  which  the  results  of 
Korolyuk's  limit  theorem  hold.  These  will  be  discussed  later  in  this 
report. 

2 . 2  Time-scaling  of  Continuous  Time  Models 

Our  research  then  turned  toward  circumventing  the  problem  that  the 
holding  time  densities  for  fault  tolerant  system  models  are  not  dependent  on 
the  small  parameter  e  representing  the  weak  interactions,  as  is  required  by 
Korolyuk's  limit  theorem.  The  approach,  which  *as  first  described  in  [12], 
is  to  introduce  a  second  small  parameter  that  represents  time  scaling  into 
the  model.  In  this  section,  we  summarize  the  results  of  that  work. 

When  the  time  axis  over  which  a  semi-Markov  model  of  fault- tolerant 
system  behavior  evolves  is  scaled  by  a  small  parameter  S,  the  holding  time 
densities  in  the  model  take  the  form  that  is  required  for  the  application  of 
Korolyuk's  limit  theorem  provided  the  parameter  &  is  proportional  to  e. 
This  idea  is  explained  rigorously  in  section  2.2.1  of  [10]  and  in  Section 
2.2  of  [13].  After  introducing  this  time  scaling,  it  is  possible  to 
rederive  the  results  that  are  of  interest  for  asymptotic  approximations  to 
the  behavior  of  these  semi -Markov  models. 

Let  E  be  the  state  space  of  a  finite  state  semi-Markov  process  that 
evolves  in  continuous  time  t.  Suppose  that  the  process  is  observed  with 
respect  to  the  scaled  time  t/5 .  Suppose  further  that  the  transition 
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operator  of  the  process  is  such  that  its  (i,j)  element  representing 
transitions  from  state  i  to  state  j  has  the  form: 


Pf.(t')  -  pf  .  F. .(t'/S) 
lj  fi]  ij  7 


i.j  «  E 

where  t'  represents  scaled  time  and  where  the  imbedded  Markov  process 


probabilities  p„  take  the  form: 


r 


(k) 


p!i  - 


Pij 


(k) 
£  q: . 


«  q 


<k) 


i,j  e  \ 


1  <  Ek’  J  f  \ 


Here  it  is  assumed  that  the  state  space  E  decomposes  into  weakly  interacting 


(k) 

classes  {E^,  ,  .  ..,  E^J  .  It  is  also  assumed  that  the  P^j  for  each  E^, 


sum 


to  unity,  hence  when  £-0  the  classes  E^  become  noninteracting  and  each 
describes  a  valid  semi-Markov  process. 

( i) 

Now  let  r^r  be  the  sojourn  time  (in  scaled  time)  of  the  process  in 
class  E^  when  it  begins  from  state  i«E^  and  transits  to  class  Er  with  r/k. 


Let  denote  the  characteristic  function  of 


Then,  if  the 
- 


for  each  k  represent  the  transition  probabilities  of  an  ergodic  Markov 
chain,  the  are  independent  of  the  superscript  i  and  take  the  form: 


<WS) 


7  w(k)  y  n (k) 


i£E. 


2  2 


"  '■  •  "  /&.  (k)  (k) . 
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where  the  '  are  the  stationary  probabilities  of  the  ergodic  semi -Markov 

process  associated  with  class  E^,  the  a„  are  the  mean  holding  times 

associated  with  the  F..(t)  in  the  original  time  scale,  a  is  S/e,  and  p,  and 

lj  Kr 

A^  are  parameters  defined  in  [10,  sec.  2.2.2]  and  in  [13,  sec.  2.2],  Note 

that  this  expression  takes  the  form  of  the  characteristic  function  of  a 
Markov  process  transition  operator  with  imbedded  transition  probability  p^r 

and  transition  rate  time  constant  A^/ a .  Thus,  the  interclass  transitions 

are  Markovian  in  scaled  time . 

This  result  is  derived  in  [10,  sec.  2.2.2]  and  in  sec.  2.2  of  [13]. 

The  result  expressed  above  makes  possible  the  analysis  of  continuous 
time  semi -Markov  models  of  fault  tolerant  system  behavior  provided  the  model 
has  ergodic  classes  (note  the  underlined  condition  above) .  Many  fault 
tolerant  system  models  violate  this  condition,  as  was  discussed  above. 
However,  many  fault- tolerant  systems  that  do  not  employ  irreversible  fault 
isolation  logic  do  produce  models  with  ergodic  classes.  Therefore,  this 
result  is  a  positive  step  toward  analysis  of  models  for  these  types  of 
systems.  Furthermore,  as  the  discussion  of  the  preceding  section  has  shown, 
models  that  satisfy  the  weakened  sufficient  condition  can  also  be 
approximated  by  these  results,  including  all  models  with  unperturbed 
processes  that  are  aperiodic  and  contain  no  more  than  one  trapping  state. 

The  manner  in  which  the  result  above  can  be  used  for  approximate 
analyses  is  as  follows.  Suppose  a  model  for  a  fault  tolerant  system  has 
been  constructed  and  one  is  interested  in  calculating  the  state 
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probabilities  for  the  model  at  some  relatively  large  value  of  time  t  in 
order  to  assess  the  reliability  (or  some  other  status-related  property)  of 
the  system.  Suppose  further  that  the  model  satisfies  the  conditions  stated 
above.  Then  the  approximate  class  occupancy  probabilities  at  the  desired 
time  can  be  calculated  by  scaling  time  appropriately,  constructing  the 
enlarged  Markov  process  that  approximately  governs  interclass  behavior  from 
the  result  above  and  solving  this  relatively  easy,  reduced  order  Markov 
process  problem.  It  is  assumed  here  that  the  initial  condition  is  known  for 
the  state  probabilities  and  therefore  also  for  the  class  occupancy 
probabilities.  The  results  should  then  be  rescaled  back  to  the  original 
time  scale.  Finally,  the  approximate  state  probabilities  can  be  evaluated 
by  weighting  the  stationary  probability  distribution  associated  with  each 
class  when  £-0  by  the  appropriate  approximate  class  occupancy  probability. 

To  illustrate  the  approximate  evaluation  procedure,  a  model  for  a 
generic  fault  tolerant  system  was  constructed  and  solved  using  both  "brute 
force"  numerical  convolution  techniques  and  the  approximate  technique 
described  above.  The  system  consisted  of  three  components  where  at  least 
one  unfailed  component  must  be  available  for  the  system  to  remain  operating. 
It  was  assumed  that  the  failure  diagnosis  algorithm  used  sequential  tests  in 
combination  with  logic  that  is  described  in  detail  in  sec.  3.1  of  [10).  The 
tests  were  assumed  to  have  second  order  Erlang  distributions  for  their  times 
to  decision.  The  logic  included  the  possibility  of  recovering  components 
that  have  previously  been  diagnosed  as  failed,  thereby  leading  to  a  model 
that  has  ergodic  unperturbed  process  classes.  The  complete  model  is 
described  in  secs.  3.3  through  3.5  and  Appendix  C  of  [10].  The  model  has  9 
states  which  decompose  into  three  classes  when  the  small  failure  rate  is  set 
to  zero. 
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The  exact  state  probability  histories  are  obtained  numerically  and  are 
described  in  chapter  4  of  [10].  It  should  be  noted  that  a  very  large  amount 
of  computational  effort  was  required  to  generate  these  exact  solutions.  The 
approximate  model  is  also  constructed  and  solved  in  chapter  4  of  [10].  The 
approximate  solutions  were  obtained  from  a  relatively  short  FORTRAN  program. 
They  could  have  been  generated  using  just  a  hand  calculator.  Only  when 
complete  time  histories  were  desired  was  it  absolutely  necessary  to  resort 
to  the  use  of  a  computer.  Upon  comparison  of  the  results,  one  finds  that 
the  largest  error  in  the  evaluation  of  any  of  the  state  probabilities  by  the 
approximate  method  for  this  example  is  less  than  1%  of  the  value  obtained  by 
numerical  means  (which  itself  is  subject  to  a  small  amount  of  error)  for 
times  greater  than  the  longest  mean  holding  time  of  the  sequential  tests, 
where  the  assumed  mean  time  between  failures  is  3  orders  of  magnitude  longer 
than  this. 

These  results  are  very  encouraging,  but  they  are  not  sufficient  to 
conclude  that  the  approximate  technique  always  works  so  well.  In  order  to 
further  investigate  the  properties  of  the  approximate  technique  with  the 
time  scaling  included,  a  number  of  four -state  semi -Markov  models  were 
examined.  These  models  were  chosen  to  reflect  various  characteristics  that 
larger  fault  tolerant  system  models  tend  to  possess.  By  keeping  the 
dimension  at  4,  however,  it  is  possible  to  generate  the  true  behavior  of  the 
model  with  relative  ease  whereas  models  of  larger  dimension  are  extremely 
difficult  to  solve  exactly  (recall  the  comments  above  regarding  the  nine- 
state  model).  Even  four-state  models  are  difficult  enough  to  solve, 
however,  that  symbolic  manipulation  was  necessary  to  generate  the  exact 
solutions.  This  is  true  despite  the  fact  that  none  of  the  holding  time 


23 


densities  in  the  models  were  assumed  to  be  any  more  complex  than  second 
order  Erlang. 

The  five  cases  of  four- state  models  that  were  examined  are  discussed  in 
detail  in  chapter  5  of  [10].  Two  of  these  cases  are  included  in  [13].  The 
approximate  method  produced  very  accurate  results  in  every  case  that  was 
examined.  The  comparison  between  the  results  was  almost  always  exact  to  4 
decimal  places  except  in  the  very  early  time  periods  before  the  startup 
transient  of  the  process  has  decayed. 

One  of  the  cases  of  four- state  models  that  was  examined  was  a  model 
that  did  not  have  ergodic  unperturbed  process  classes  (Case  IV  of  [10]  or 
Case  2  of  [13]).  Because  these  examples  are  artificially  constructed,  it  is 
relatively  easy  to  check  the  weakened  sufficient  condition  that  was 
discussed  in  the  previous  section.  A  brief  calculation  shows  that  it  is 
satisfied,  therefore  the  results  of  Korolyuk's  limit  theorem  hold  for  this 
model  as  well  as  the  other  cases  that  were  examined.  This  makes  the 
accuracy  of  the  results  obtained  by  the  approximate  method  not  surprising. 

As  was  discussed  at  the  end  of  the  preceding  section,  however,  we  still 
desire  a  means  for  determining  whether  the  results  of  Korolyuk's  limit 
theorem  hold  without  calculating  the  operator  for  each  class  of  the 

unperturbed  model.  The  next  section  discusses  our  work  along  these  lines. 

2 . 3  Relaxation  of  Ergodicitv  Condicion 

Many  fault  tolerant  systems  yield  generalized  Markovian  models  of  their 
behavior  that  decompose  into  classes  that  satisfy  all  of  the  conditions  for 
applying  the  approximate  technique  except  the  condition  that  the  classes  of 
the  unperturbed  process  must  be  ergodic.  This  is  typically  the  result  of 
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irreversible  logic  structures  in  the  RM  algorithm  for  the  system  such  that 
diagnostic  decisions  alone  can  permanently  eliminate  a  component  from  use. 

However,  in  the  analysis  of  four-state  models  discussed  above,  it  was 
noted  that  excellent  results  were  obtained  when  the  approximate  method  was 
applied  to  a  case  where  the  unperturbed  process  generated  by  the  model  did 
not  possess  ergodic  classes.  As  was  noted  above,  this  is  not  surprising 
because  the  inverse  operator  discussed  in  section  2.1  exists  for  each  of  the 
classes  in  this  case.  As  we  also  noted  above,  however,  we  desire  a  simpler 
method  for  checking  whether  the  results  of  Korolyuk's  limit  theorem  are 
valid  than  computing  the  determinant  of  each  of  the  operators  I-P^+w^. 

After  careful  examination  of  the  underlying  reasons  that  the  results  of 
Korolyuk's  limit  theorem  hold  for  cases  where  the  unperturbed  process  does 
not  possess  ergodic  classes,  we  developed  the  following  theorem: 

Theorem  1:  Let  a  semi-Markov  process  depend  upon  e  such  that  it  can  be 
decomposed  and  time  scaled  in  the  manner  described  in  section  2.2. 
Suppose  in  addition  that  the  imbedded  Mark-'--  process  transition  operator 

associated  with  the  k^  cl^ss  of  the  unperturbed  process  satisfies: 

lim  i  2,  PJ"  fee  ...  elT 

n-*»  m-1 

where  e  is  some  constant  vector  whose  elements  are  nonnegative  and  sum 
to  unity.  Suppose  this  is  true  for  each  k  (with  different  e  in  each 
case,  in  general).  Then  the  interclass  transition  behavior  of  the 
perturbed  process  approaches  the  same  enlarged  Markov  process  behavior 
that  was  described  in  section  2.2  as  e  approaches  zero. 
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Proof:  The  proof  of  this  theorem  appears  in  chapter  6  of  [10]  and  section 
3  of  [13] . 

The  sufficient  condition  of  Theorem  1  is  weaker  than  the  ergodicity  of 
the  classes  that  was  required  by  Korolyuk.  Furthermore,  the  sufficient 
conditions  of  Theorem  1  are  more  easily  checked  than  the  weakened  sufficient 
condition  that  was  discussed  in  section  2.1  because  the  determinant  of  the 
operator  I-P^+w^  need  not  be  computed.  Note,  however,  that  the  Cesaro  limit 

operator  must  be  computed  to  check  the  conditions  of  Theorem  1.  This  is 

still  undesirable  from  a  practical  computational  standpoint. 

The  analysis  leading  to  Theorem  1  led  us  to  consider  the  specific 
situations  in  which  the  conditions  of  the  theorem  are  satisfied.  This 
investigation  led  to  the  following  refinement: 

Theorem  2:  Let  a  semi-Markov  process  depend  on  e  such  that  it  can  be 
decomposed  into  classes  and  time  scaled  as  prescribed  in  section  2.2. 
The  transition  operator  of  the  imbedded  Markov  process  associated 

with  the  k1"'1  class  of  the  unperturbed  process  will  satisfy  the 
conditions  of  Theorem  1  if: 

1.  The  kC^  class  is  ergodic,  or 

2.  P^  has  one  and  only  one  eigenvalue  of  unity. 

Proof:  The  proof  of  this  theorem  also  appears  in  chapter  6  of  [10]  and 
section  3  of  [ 13 ] . 

Theorem  2  provides  a  more  restrictive  but  more  easily  checked 
sufficient  condition  than  Theorem  1  because  the  Cesaro  limit  is  no  longer 
necessary . 
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It  should  be  emphasized  that  Theorems  1  and  2  still  represent  only 
sufficient  conditions  for  the  approximate  technique  to  yield  an  accurate 
approximation  to  the  behavior  of  the  perturbed  process  as  the  small 
parameter  e  becomes  small.  In  other  words,  there  may  exist  perturbed  semi- 
Markov  models  that  do  not  satisfy  the  conditions  stated  in  Theorem  1  or 
Theorem  2  whose  behavior  can  still  be  approximated  well  by  the  approximate 
method. 

Some  examples  of  model  structures  that  do  and  do  not  satisfy  the 
sufficient  conditions  of  Theorem  1  or  2  are  presented  in  chapter  6  of  [10] 
and  at  the  end  of  section  3  of  [13].  One  example  in  particular  that  does 
not  satisfy  the  conditions  includes  a  class  in  its  unperturbed  process  model 
that  contains  multiple  trapping  states.  We  shall  discuss  this  situation 
later  in  section  2.6. 

2 . 4  Discrete  Time  Models 

All  of  the  results  described  so  far  in  this  report  have  applied  to 
continuous  time  models  of  fault  tolerant  system  behavior.  However,  the  RM 
algorithms  for  fault  tolerant  systems  are  usually  implemented  on  a  digital 
computer  with  a  significant  time  interval  between  successive  applications  of 
the  diagnosis  tests.  Therefore,  fault  tolerant  system  models  are  often 
purely  discrete  time  in  nature. 

During  the  course  of  the  project,  parallel  efforts  were  made  to  derive 
results  for  perturbed  discrete  time  semi-Markov  processes  that  mimic  those 
discussed  above  for  continuous  time  processes.  This  section  and  the  paper 
[15]  reports  on  these  efforts. 

Much  of  the  work  that  has  been  accomplished  during  the  project  for 
discrete  time  models  has  related  to  the  adaptation  of  Korolyuk's  limit 
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theorem  for  semi-Markov  processes  [8]  to  semi-Markov  chains.  In  addition,  a 
limit  theorem  with  time  scaling  for  semi-Markov  chains  was  also  developed. 
The  theorem  statements  will  be  summarized  below. 

An  important  result  that  will  be  referred  to  in  the  statement  of  both 
theorems  is  the  following: 


(k)  (k) 

LEMMA  3:  Let  P  -  [ p > j  ]  represent  an  imbedded  Markov  chain  operator  of 


a  semi-Markov  chain  E^.  Consider  the  system  of  equations  below: 


- 


2 

j£Ek 


_  (k) 
PU 


0 


The  solution  of  this  system  of  equations  is  independent  of  the 
superscript,  that  is: 

•^(m)  ”  ^kr^m^  for  a11  ieEk 


(k) 

if  and  only  if  the  imbedded  Markov  chain  transition  operator  Pv  has  at 
most  a  single  unit  magnitude  eigenvalue. 

Proof :  This  lemma  is  proved  in  [14]. 

Thus,  any  ergodic  imbedded  Markov  chain  operator  (for  which  all 
eigenvalues  have  less  than  unit  magnitude)  will  satisfy  Lemma  3.  In 
addition,  any  monodesmic  imbedded  Markov  chain  operator  (one  that  has  only 
one  trapping  or  absorbing  state,  and  hence  a  single  unit  magnitude 
eigenvalue)  will  also  satisfy  Lemma  3.  This  assertion  is  similar  to  Theorem 
2  of  section  2.3  for  continuous  time  models. 

We  shall  now  state  a  theorem  that  describes  how  a  semi-Markov  chain 
which  is  dependent  on  a  small  parameter  £  can  be  approximately  described  by 
a  Markov  chain.  This  theorem  is  derived  based  on  the  results  for  semi- 


Markov  processes  in  [8].  The  semi -Markov  chains  here  are  assumed  to  depend 
on  a  small  parameter  £  such  that  the  state  space  can  be  decomposed  into 


disjoint  classes  of  states  where  the  probabilities  of  departure  from  each 
class  tend  to  zero  along  with  e.  In  addition,  the  total  sojourn  in  each 
class  is  assumed  to  have  a  non- degenerate  distribution  in  the  limit  as  e  -♦ 
0. 

THEOREM  4:  A  Limit  Theorem  for  Semi -Markov  Chains 

Let  the  set  E  of  states  of  the  semi-Markov  chain  be  expressible  as  a 
union  of  disjoint  classes 

Ne 

E  -  2  E,  k  e  M  -  {1,2, ...Ne) 

k-1  K 

Let  7^^  be  the  sojourn  of  the  semi-Markov  chain  in  class  E^  when  it 
starts  from  state  ifE^  and  moves  to  class  Er>  The  following  two  sets 
of  conditions  are  assumed  to  hold: 


1.  The  elements  of  the  core  matrix  sequence  {g^ (m) [ i , j eE)  specifying 

the  semi-Markov  chain  depend  as  follows  on  the  small  parameter  e: 

ge.  .  (m)  -  p?  .  h.  .  (®) 

6ij  *ij  ijve' 

where  h„(0)  -  0.  The  p^  may  be  expanded  in  a  Taylor  series 

about  e  -  0.  Retaining  only  linear  terms  in  £  in  explicit  form: 

pf.  -  p^  -  £  +  ...  +  0(0 ;  i ,  j  £  E. 

ij  ij  lj  k 

-  £  qf^  +  ...  +  0(0;  i  £  ;  j/E^ 


where  0(0  represents  terms  such  that  lig 


Oiel 


-  0. 


The  imbedded  Markov  chain  obeys  the  usual  Markov  chain  properties: 


2  p^°  -  1;  and  p^}  £  [0,1]; 

j£Ek  1J  1J 
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for  all  i,j  e  E,  and  for  all  k  «  M 


2.  The  imbedded  Markov  chain  defined  by  the  transition  probability 
(k) 

matrices  {p..  |i,j«E,  for  each  kfM}  are  ergodic  with  stationary 

1J  K 


distributions  { «■  V  VjifE^,  keM) 


Then; 


lim  <  t)  -  7kr  [1  -  exp( -A.  t/T) 


where : 


Here : 


2  <k)  (kr) 

.  _  7T  .  Q  . 

lfEk  1  1 

y  „<k)  n<k> 
2  *  <*i 

1£Ek 


2  (k)  <k) 

icEk  1  1 

2  *<k>  a<k> 

ieEk  1 


2  q<k)  . 

j  eEr  1J 


2  q- 


a.  '  -  2  p  7.. 


2  m  h . , (m) 


Proof :  The  proof  of  this  theorem  appears  in  [14]  and  comprises  most  of 


section  2  of  [15] . 
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Although  the  above  theorem  is  useful,  it  is  not  directly  applicable  to 
most  fault  tolerant  system  models  for  two  reasons:  (1)  the  imbedded  Markov 
chains  for  such  models  are  usually  non-erpodic  (as  has  been  stated  on 
numerous  occasions  in  this  report  already),  and  (2)  the  holding  time  density 
functions  are  usually  not  dependent  on  m/«  but  only  on  m.  Hence,  two 
necessary  adjustments  must  be  made  in  the  above  theorem.  The  first  of  these 
is  to  determine  what  conditions  must  be  satisfied  by  the  imbedded  Markov 
chain  in  order  for  the  results  to  be  valid.  This  leads  to  Lemma  3,  which 
has  already  been  stated.  The  second  of  these  adjustments  is  to  incorporate 
time  scaling  into  Theorem  4  in  a  manner  analogous  to  the  time-scaled  limit 
theorem  of  section  2.2.  The  following  theorem  results  from  this 
consideration: 


SOREM  5:  A  Limit  Theorem  With  Time  Scaling  for  Semi -Markov  Chains 
Let  the  set  E  of  states  of  a  semi-Markov  process  be  expressible  as  a 


union  of  disjoint  classes 


E  -  2  E, 

k-1  * 


k  e  M  {-  1,2, . . .N } 


Let  7^  be  the  sojourn  of  the  semi -Markov  chain  in  class  E^  when  it 
starts  from  state  ieE^  and  moves  to  class  E^.  Let  the  following  two 
sets  of  conditions  hold  for  the  semi -Markov  chain: 

1.  The  elements  of  the  core  matrix  sequence  {gf^ (m) | i,jeE)  specifying 
the  semi-Markov  chain  depend  as  follows  on  the  small  parameter  5: 


<">  -»!j  V?> 
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where  h„(0)  -  0.  The  p„  may  be  expanded  in  a  Taylor  series 

about  £  -  0  as : 

"u  •  ‘  +  •••  +0<‘);  l’J  '  Ek 


-  £  q£k)  +  .  .  .  +  0(f);  i  £  Ek  ;  j  /  Ek 

The  imbedded  Markov  chain  obeys  the  usual  Markov  chain  properties: 

S  p^k)  -  1;  and  p^k)  £  [0,1]  for  all  i.j  £  E.  and  all  k  £  M 
.  £  Kij  k 

J  k 


and 

2.  The  imbedded  Markov  chains  defined  by  the  transition  probability 
(k) 

matrices  { p 1 ^  |i,j£E^  ,  kfM)  have  at  most  a  single  unit  magnitude 
eigenvalue  (hence,  ergodic  or  monodesmic)  with  stationary 
distribution  (ir^k^|i£Ek  ,  keM) . 

Then: 


li“  Pr(7kr  -  c)  ”  7kr  f1  '  expC-Ajjt/a'T)] 


where  y 


kr' 


A^,  qfrk\  q£k^ ,  and  a£k^ ,  were  all  defined  in  Theorem  4  and 


a  - 


4 

£ 


Proof:  This  theorem  is  proved  in  [14].  The  steps  in  the  proof  are 
nearly  identical  to  those  of  the  proof  of  Theorem  4. 

Once  these  results  had  been  derived,  we  turned  our  attention  to 
demonstrating  these  results  on  some  models  for  fault  tolerant  systems  that 
were  simple  enough  to  yield  analytical  results  but  still  illustrative  of  the 
behavior  of  fault  tolerant  system  models  discussed  in  the  preceding 


sections.  The  next  section  summarizes  these  efforts. 


The  results  of  Theorem  5  were  applied  to  three  realistic  examples  of 
fault  tolerant  systems  for  which  semi -Markov  chain  reliability  models  can  be 
derived.  Two  rather  simple  fault  tolerant  systems  were  considered.  The 
first  of  these  systems  is  a  single  component  monitoring  system  (SCMS) .  A 
single  non-essential  component  is  monitored  by  a  sequential  test  to  indicate 
any  detected  faults  for  the  information  of  the  pilot.  This  produces  a  3- 
state  semi -Markov  reliability  model  that  can  be  decomposed  into  two  classes. 
The  nature  of  these  classes  depends  upon  the  assumptions  that  are  made 
regarding  the  monitoring  strategy.  The  second  system  that  was  considered  is 
a  single -component  dual  redundant  (SCDR)  system.  This  system  consists  of 
two  identical  components  operating  in  parallel  with  a  RM  strategy  to  detect 
and  identify  failures  and  to  select  the  appropriate  component  for  use. 
Under  a  particular  set  of  assumptions  regarding  the  RM  strategy,  the  model 
for  the  SCDR  has  eight  states  and  three  non-ergodic  classes. 

The  results  for  these  two  systems  will  be  summarized  below. 

2.5.1  SCMS  with  Continuous  Monitoring  (SCMS -I) 

The  results  for  the  SCMS -I  model  are  summarized  in  sections  5.1  and  5.2 
of  [15] ,  which  is  also  attached  to  this  report  as  an  Appendix.  It  should  be 
noted  that  the  reliability  model  for  SCMS -I  is  simple  enough  to  be  solved 
analytically  by  z- transform  methods.  Therefore,  an  exact  analytical  answer 
is  available  with  which  to  compare  the  results  generated  by  the 
approximation  implied  by  Theorem  5.  Most  reliability  models  do  not  have 
this  property,  however  the  SCMS -I  system  was  chosen  for  examination  in  part 
because  of  this  property. 


The  basic  assumptions  made  in  constructing  the  reliability  model  for 
SCMS-I  were  that  a  sequential  monitoring  test  is  continuously  applied  to  the 
component  and  that  this  test  has  second  order  hypergeometrically  distributed 
times  to  decision  under  both  normal  and  failed  conditions.  The  second  order 
hypergeometric  assumption  makes  possible  the  analytical  solution  discussed 
in  the  preceding  paragraph.  Furthermore,  the  second  order  hypergeometric 
distribution  is  a  reasonable  approximation  to  the  decision  time  distribution 
that  is  observed  in  practice  for  sequential  monitoring  tests.  It  should 
also  be  noted  that  in  the  construction  of  the  SCMS-I  model,  it  is  assumed 
that  a  value  is  known  for  the  probability  of  an  eventual  false  alarm 
decision  when  no  failure  is  present.  In  practice,  this  probability  would 
probably  not  be  known  and  the  elements  of  the  core  matrix  would  have  to  be 
constructed  directly  from  the  probability  of  a  false  alarm  decision  as  a 
function  of  the  elapsed  time,  which  could  be  found  by  simulating  the 
monitoring  test  with  no  failure  present. 

Section  5.1  of  [15]  shows  how  the  model  is  decomposed  and  how  the 
approximate  results  are  calculated.  The  section  concludes  with  the 
approximate  expression  for  the  state  occupancy  probabilities  that  are  valid 
for  large  values  of  t: 


- «A, t/T  -  eA, t/T  -eA.t/T 

**  -  [  (1-Ffa)  e  1  .  P£a  .  1  ,  1  -  .  1  ] 

where  e  is  the  (small)  failure  probability  during  each  time  step,  T  is  the 
time  step,  and  A^  is  determined  in  terras  of  the  parameters  of  the  decision 

time  distributions  and  other  model  quantities  and  represents  the  interclass 
transition  rate. 
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As  section  5.2  of  [15]  shows,  the  results  produced  by  the  approximation 
are  very  accurate  after  an  initial  transient  period  that  is  a  small  fraction 
of  the  mean  time  between  failures  of  the  components  (which  is  essentially 
proportional  to  1/e) .  For  the  numerical  values  of  the  parameters  that  were 

A 

selected,  the  error  in  each  of  the  elements  of  n  is  less  than  5%  after 
1/500  of  the  MTBF  has  elapsed.  Section  5.2  of  [15]  also  shows  that  values 
of  e  up  to  approximately  0.001  produce  good  agreement  between  the 
approximation  and  the  exact  answer.  Finally,  it  is  shown  in  Section  5.2  of 
[15]  that  the  class  probability  predicted  by  the  enlarged  Markov  process  is 
a  first  order  approximation  in  e  to  the  exact  answer  for  this  example  and 
that  the  approximation  also  includes  the  dominant  second  order  term, 
although  other  second  order  terms  are  not  accounted  for. 

The  excellent  results  for  SCMS-I  are  not  surprising  because  the 
reliability  model  for  this  system  obeys  the  conditions  for  Korolyuk's 
original  limit  theorem  with  one  exception,  namely  the  condition  requiring 
that  the  holding  c4me  distributions  compress  as  the  small  parameter  e 
vanishes.  The  classes  of  the  decomposed  model  are  ergodic  in  this  case,  so 
only  time  scaling  need  be  used  to  generate  the  excellent  results  that  were 
obtained. 

2.5.2  SCMS  with  Abbreviated  Monitoring  (SCMS-II) 

If  the  sequential  test  that  is  used  for  monitoring  in  the  SCMS  is 
terminated  upon  its  first  indication  of  a  failure,  then  the  resulting 
reliability  model  is  slightly  different  from  the  SCMS-I  model.  In 
particular,  one  of  the  two  classes  of  the  decomposed  unperturbed  model  is 
nonergodic.  This  model  is  described  in  section  5.3  of  [15].  The  same 
assumptions  regarding  the  decision  time  distributions  and  the  existence  of  a 
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value  for  the  eventual  false  alarm  probability  are  made  as  in  the  SCMS-I 
model.  After  executing  the  steps  of  decomposing  the  model  and  calculating 

A 

the  parameters  of  the  approximating  Markov  process,  the  result  for  n  is: 


A 

e 

n 


[  0, 


-t/T 


e 


-t/T 


] 


The  SCMS-II  model  can  also  be  solved  analytically  using  the  same  z-transform 
procedure  that  was  used  for  SGMS-I. 

As  section  5.4  of  [15]  points  out,  the  errors  in  the  approximation  to 
it 2  and  are  extremely  small  (<0.01%)  for  all  values  of  t.  However,  the 

error  in  the  approximation  to  is  always  100%  because  the  approximation  is 

exactly  zero.  This  results  from  the  stationary  distribution  that  yields  the 

value  of  zero  for 

Analytically  comparing  the  class  probability  results  generated  by  the 
approximation  and  the  analytical  solution  shows  that  the  SCMS-II  class 
probabilities  are  accurate  through  the  dominant  second  order  term  in  e . 
This  is  the  same  result  that  was  observed  for  the  SCMS-I  model. 


2.5.3  SCDR  System 

The  SCDR  system  is  the  simplest  redundant  system  that  can  be 
constructed.  It  was  chosen  for  analysis  in  the  hope  that  exact  results 
could  be  obtained  for  it  by  the  same  method  that  was  applied  to  the  SCMS 
systems  discussed  above.  The  SCDR  system  was  assumed  to  have  an  independent 
sequential  test  monitoring  each  of  the  components  where  the  tests  are  both 
reset  once  either  of  them  has  reached  a  no- failure  decision  (which  is 
commonly  done  in  practice  to  minimize  the  number  of  false  alarms).  Also,  it 
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was  assumed  that  the  system  can  survive  for  short  periods  of  time  (in  this 
case,  one  RM  time  interval)  while  using  a  failed  component,  but  not  for 
extended  periods  of  time.  The  RM  strategy  is  to  use  one  of  the  components 
(designated  the  primary  component)  until  its  test  indicates  that  it  is 
failed,  then  switch  to  the  other  (backup)  component  unless  it  is  also 
indicated  as  failed,  in  which  case  use  of  the  primary  component  continues 
despite  the  failure  indication.  Both  monitoring  tests  are  discontinued  once 
a  failure  indication  by  either  has  occurred. 

The  details  of  the  analysis  of  the  SCDR  system  are  presented  in  [14]. 
In  this  report,  the  results  will  be  briefly  summarized. 

If  the  usual  techniques  are  used  for  constructing  the  semi-Markov 
reliability  model  for  the  SCDR  system,  the  result  is  a  six-state  system 
referred  to  as  the  SCDR-ASL  model  in  (14] .  This  model  does  not  decompose  in 
the  manner  required  for  application  of  the  limit  theorems  discussed  in  the 
preceding  section.  The  reason  for  this  is  that  the  standard  reliability 
model  construction  technique  designates  a  single  aggregated  state  as  the 
system  loss  state.  The  SCDR  system  with  the  RM  strategy  outlined  above  has 
several  routes  by  which  system  loss  can  occur  following  the  first  failure. 
Some  of  these  routes  (like  repeated  missed  detections  of  the  failed 
component)  occur  in  the  "fast"  time  scale  associated  with  the  RM  decisions. 
At  least  one  (the  occurrence  of  a  second  failure  following  correct 
reconfiguration)  occurs  in  the  "slow"  time  scale  and  therefore  contributes  a 
transition  probability  that  is  proportional  to  the  small  parameter  e .  The 
result  is  that  the  aggregated  system  loss  state  communicates  by  both  fast 
and  slow  transitions  with  states  in  a  single  class.  This  violates  the 
conditions  necessary  to  decompose  the  model  properly. 


This  difficulty  is  rather  easily  overcome.  By  decomposing  the 
aggregated  system  loss  state  into  three  states  with  each  reflecting  a 
particular  route  to  system  loss,  an  eight-state  model  referred  to  in  [14]  as 
the  SCDR-TS  model  is  produced.  This  model  decomposes  into  three  classes  in 
the  manner  prescribed  by  the  conditions  of  the  limit  theorems.  However,  one 
of  these  classes  includes  two  distinct  trapping  states.  This  violates  the 
sufficient  conditions  for  both  of  the  discrete  time  limit  theorems  stated  in 
the  preceding  section.  This  means  that  the  results  of  these  theorems  do  not 
necessarily  apply  to  this  model. 

Recall  that  a  weakened  sufficient  condition  for  the  application  of 
Korolyuk's  limit  theorem  was  discussed  in  section  2.1.  Let  us  check  this 
condition  for  this  model.  Referring  to  the  description  of  the  SCDR  model  in 
[14],  we  find  that  the  imbedded  Markov  process  transition  probability  matrix 
for  Class  2  is: 
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where  P^a  is  the  eventual  probability  of  a  false  failure  indication  by  one 

of  the  tests,  P^  is  the  probability  of  an  eventual  missed  detection  by  one 

of  the  tests,  and  the  two  tests  are  assumed  to  be  identical.  In  practice, 
of  course,  P^a  and  P^  would  not  necessarily  be  known  explicitly  but  rather 

would  be  implicit  in  the  statistical  description  of  the  decision  behavior  of 
the  tests.  For  this  analysis,  however,  let  us  assume  that  these  values  are 
known. 
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Despite  the  presence  of  several  zero  values  in  P£ ,  it  is  quite 

difficult  to  obtain  an  analytical  solution  for  the  steady  state  operator  lim 

k-«*> 

?2  in  terms  of  the  variables  P^a  and  P^.  Doing  so  requires  z- transform 

analysis  that  includes  factoring  a  fifth  order  polynomial  symbolically. 

However,  when  numerical  values  are  substituted  for  P,.  and  P  ,  it  is 

fa  m 

relatively  easy  to  find  the  limiting  transition  operator  and  thence  to  check 
for  the  existence  of  the  inverse  of  the  operator  I-P2+*2-  In  particular, 

when  Pc  -0.05  and  P  -0.1,  we  find  that  the  determinant  of  I-P„+w„  is  0.0049. 
fa  m  2  2 

Therefore,  the  relaxed  sufficient  condition  is  satisfied  and  Korolyuk's 

limit  theorem  results  apply.  However,  for  these  values  (and  for  several 

other  sets  of  reasonable  values  that  were  tried),  the  condition  number  of  I- 

P2+W2  is  quite  large  (nearly  500  for  the  values  cited  above,  much  larger  for 

smaller  values  of  the  eventual  transition  probabilities) .  This  implies  that 
numerical  errors  are  likely  in  evaluating  some  of  the  quantities  that  are 
used  to  describe  the  enlarged  Markov  process  that  approximates  the 
interclass  behavior  of  the  perturbed  process . 

Another  problem  with  this  model  is  that  a  unique  stationary 
distribution  for  Class  2  of  the  unperturbed  process  does  not  exist  in 
general.  For  instance,  for  the  values  cited  above  for  P^a  and  P  ,  it  is 

known  only  that  the  stationary  distribution  of  the  unperturbed  process  in 
Class  2  is  a  linear  combination  of  the  distributions  [00010]  and  [000 
0  1]  where  the  weights  in  the  linear  combination  depend  upon  the  initial 

condition  for  this  unperturbed  process.  This  means  that  the  enlarged 
process  cannot  be  expanded  in  terms  of  a  stationary  distribution  to 
approximate  the  distribution  of  the  original  unperturbed  process. 
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The  fact  that  the  SCDR-TS  model  could  not  be  analyzed  using  the 
approximate  technique  motivated  us  to  pursue  further  the  problem  of  models 
with  multiple  trapping  states  in  the  classes  of  the  unperturbed  model.  The 
results  of  that  effort  are  reported  in  the  next  section. 

Returning  to  the  analysis  of  the  SCDR  system,  it  is  possible  to 
construct  a  reliability  model  for  this  system  that  satisfies  the  sufficient 
conditions  for  Theorem  5  of  the  preceding  section.  This  is  accomplished  in 
section  4.4  of  [14]  and  produces  the  model  referred  to  there  as  SCDR- I. 
Unfortunately,  the  construction  of  SCDR- I  involves  the  merging  of  two  of  the 
states  in  the  SCDR-TS  model  where  one  of  these  states  represents  a  working 
system  and  the  other  represents  a  failed  system.  Therefore,  although  the 
SCDR- I  model  possesses  the  mathematical  properties  that  are  necessary  for 
the  approximate  reliability  evaluation  technique  to  be  applied  to  it,  the 
results  are  of  questionable  value  because  they  cannot  be  used  to  generate 
the  system  reliability.  Nevertheless,  the  analysis  of  the  SCDR-I  model  was 
carried  out  in  an  effort  to  expand  our  insight  on  the  use  of  the  approximate 
technique . 

Under  the  assumption  that  the  times  to  decision  for  each  of  the 
monitoring  tests  of  the  SCDR  system  are  distributed  according  to  a  second 
order  hypergeometric  distribution,  the  results  of  approximate  analysis  of 
the  SCDR-I  model  are  described  in  [14].  The  numerical  results  indicate 
excellent  approximation  to  the  behavior  of  the  perturbed  model  by  the 
enlarged  process  even  for  relatively  large  values  of  e  (on  the  order  of 
.05).  Furthermore,  it  is  shown  in  [14]  that  the  approximate  technique 
applied  to  the  SCDR-I  model  yields  values  for  the  class  occupancy 
probabilities  that  agree  to  first  order  with  the  exact  answer  and  also 
include  the  dominant  second  order  term  in  the  exact  answer.  This  is 
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significant  because  it  is  the  same  as  the  results  obtained  for  the  SCMS 
models  discussed  above  despite  major  differences  between  the  SCMS  models  and 
SCDR-I. 

The  SCDR-I  model  is  sufficiently  complicated  that  the  powerful  symbolic 
manipulation  language  MACSYMA  could  not  generate  analytical  solutions  for 
the  individual  state  occupancy  probabilities,  even  when  the  variables  in  the 
system  were  replaced  by  numerical  values.  Only  the  class  occupancy 
probabilities  could  be  found  analytically.  Thus,  all  of  the  numerical 
results  discussed  in  [14]  for  the  SCDR-I  model  are  limited  to  class 
occupancy  results.  The  difficulty  encountered  in  generating  an  exact 
solution  to  the  reliability  model  for  a  system  as  simple  as  the  SCDR  system 
with  very  simple  holding  time  assumptions  (second  order  hypergeometric)  puts 
added  focus  on  two  critical  aspects  of  this  research.  First,  the  need  for 
approximate  techniques  is  clearly  apparent  when  even  the  simple  SCDR  system 
cannot  be  analyzed  by  exact  techniques.  Second,  the  fact  that  exact 
solutions  for  simple  systems  cannot  be  derived  motivates  research  on 
techniques,  perhaps  computationally  intensive  techniques,  for  generating  the 
baseline  solutions  with  which  approximate  results  will  be  compared  to 
determine  the  validity  of  the  approximations. 

2 . 6  Extension  to  Models  with  Multiple  Trapping  States  in  the  Classes 

The  SCDR-TS  model  and  some  of  the  continuous  time  models  that  do  not 
satisfy  the  conditions  of  Theorem  2  in  section  2.3  illustrate  a  type  of 
model  for  which  the  results  discussed  to  this  point  are  not  applicable. 
These  models  include  multiple  trapping  states  in  some  of  the  classes  of  the 
decomposed  unperturbed  model.  When  this  occurs,  the  transition  operator 

for  some  of  the  classes  can  have  multiple  eigenvalues  of  unity  and  the 
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Cesaro  limit  defined  in  the  sufficient  conditions  of  the  theorems  does  not 


in  general  have  the  necessary  form. 

In  the  fault  tolerant  system  context,  these  types  of  models  occur  when 
the  results  of  RM  decisions  without  additional  failures  can  yield  system 
configurations  with  widely  different  performance  interpretations.  For 
instance,  in  the  case  of  the  SCDR  system,  a  detected  failure  results  in  a 
working  system  while  a  false  alarm  combined  with  failure  of  the  other 
component  results  in  a  failed  system.  Both  of  the  states  representing  these 
situations  are  in  the  same  class  of  the  model  and,  because  the  RM  tests  are 
terminated  upon  any  failure  indication,  are  trapping  states  for  this  class 
when  the  failure  rate  is  set  to  zero. 

In  this  section,  we  will  present  an  approach  that  extends  the  limit 
theorem  results  of  the  previous  sections  to  this  case.  The  results  are 
preliminary  because  this  work  had  just  begun  shortly  before  the  termination 
of  the  grant.  We  limit  the  discussion  here  to  continuous  time  models. 
Discrete  time  models  remain  to  be  investigated. 

Suppose  that  a  continuous  time  perturbed  semi -Markov  model  is  obtained 
for  the  behavior  of  a  fault  tolerant  system  such  that  it  decomposes  into 
classes  and  can  be  time-scaled  as  in  section  2.2.  Suppose  also  that  at 
least  one  of  the  classes,  say  E^,  yields  an  unperturbed  imbedded  Markov 

process  transition  operator  P  that  corresponds  to  a  process  that  has 

transient  states,  labelled  s^,  S2,  and  so  on,  and  multiple  trapping  sets, 

where  a  trapping  set  can  consist  either  of  a  single  state  or  a  closed 
communicating  class  of  recurrent  states.  The  procedure  for  applying  the 
results  of  the  time-scaled  limit  theorem  to  this  case  is  as  follows. 
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Let  the  trapping  sets  in  Er  be  labelled  R^ ,  R2  ,  and  so  on  and  let  S  be 

the  set  of  transient  states  in  E  .  For  each  state  s.  «  S,  calculate  the 

r  1 


probability  that  a  transition  is  eventually  made  to  trapping  set  R^  for  each 
j .  This  can  be  accomplished  using  only  the  information  provided  by  P  as 


follows. 

Define  an  initial  condition  on  the  unperturbed  process  whereby  the 
probability  of  occupying  state  s^  is  unity.  The  unperturbed  imbedded  Markov 

process  transition  operator  P  can  then  be  used  to  calculate  the  probability 

p  that  each  of  the  trapping  sets  is  the  one  eventually  occupied  when  the 

Sirj 

process  starts  in  state  s^  by  successively  operating  on  this  initial 

condition  until  the  probability  that  any  of  the  states  in  S  is  occupied 
vanishes  and  then  adding  together  the  probabilities  of  occupying  the  states 

within  that  trapping  set  when  steady  state  is  reached.  Since  the  rC^  class 
contains  only  a  subset  of  the  states  of  the  overall  model,  the  dimension  of 
P^  can  be  small  enough  that  the  problem  of  finding  the  steady  state  trapping 

set  probabilities  can  be  solved  by  transform  methods. 

Once  these  probabilities  are  determined,  we  decompose  class  r  into  as 
many  subdivisions  as  there  are  trapping  sets.  Each  of  these  subdivisions 
becomes  a  new  class  in  the  modified  decomposed  model.  Let  these  classes  be 

labelled  E  ,  E  ,  and  so  forth.  As  before,  let  be  the  sojourn  time 

rl  r2  Krj 

from  state  i  e  E.  to  class  E  with  E  e  E  *  E,  and  let  i5^(s)  be  its 

k  r.  r.  r  k  ^kr. 

J  J  j 

characteristic  function.  Suppose  that  P^  has  the  Cesaro  limit  property  that 


was  the  condition  for  Theorem  1  of  section  2.3  (and  recall  that  Theorem  2 
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showed  that  ergodicity  of  or  the  existence  of  one  and  only  one  eigenvalue 
of  unity  for  were  sufficient  for  this  property).  We  then  have  that,  in 
scaled  (i.e.  slow)  time,  as  e  approaches  zero: 


(s)  -  ^r.(s) 


+  2  p 


s .  r . 
i  J 
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where  the 

transition 


summation 

limiting 


is  over  all  i  for  which  s.  e  S,  if 

l  *xr 

sojourn  characteristic  function, 


(s)  is  the  direct 

j 

and  4>,  (s)  is  the 

KSi 


limiting 

parameters 


sojourn  characteristic  function  from  class  k  to  state  s^  e  S.  The 
Pkr  .  P^g  >  and  in  the  second  line  are  determined  in  exactly 


the  same  fashion  as  the  parameters  p^r  and  A^  in  the  continuous  extension  of 

Korolyuk's  limit  theorem  stated  in  section  2.3  above.  Note  that  this  result 
is  independent  of  the  starting  state  in  class  k,  i.e.  the  result  is 
independent  of  the  superscript. 

This  result  implies  that  an  enlarged  Markov  process  can  be  defined  that 
approximates  the  transition  behavior  of  the  perturbed  process  among  the 
trapping  sets  of  the  classes  in  scaled  time.  As  before,  the  approximate 
state  occupancy  probabilities  can  then  be  found  by  expanding  the  class 
probabilities  using  the  steady  state  distribution  of  the  unperturbed  process 
within  each  class,  if  the  steady  state  distribution  exists. 
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The  proof  of  this  result  is  still  incomplete.  However,  it  will  be 


outlined  here . 


There  are  essentially  three  steps  to  the  proof.  The  first  is  to  show 
that  the  characteristic  equatio^  of  the  sojourn  from  any  state  in  to  any 


state  in  S  has  the  form  p,  ~ — 

Si  Ak  S 


Next  is  to  show  that  the  sojourn  from 


any  state  in  directly  (i.e.  without  occupying  any  of  the  states  in  S  in 
the  process)  to  any  state  in  E^  also  has  this  form  with  s^  replaced  by  r \  . 

Finally,  we  must  show  that  the  linear  combination  of  these  terms  in  the 
final  result  produces  the  characteristic  function  of  the  sojourn  from  any 
state  in  E^  to  any  state  in  E  .  The  proof  outline  will  describe  how  this 

rj 

can  be  done  when  S  comprises  only  one  state.  The  extension  to  multiple 
transient  states  is  straightforward,  but  had  not  been  completed  at  this 
writing. 

Since  S  and  E  are  subsets  of  E  ,  the  first  two  steps  in  the  proof  are 

rj  r 

already  complete  by  applying  the  results  of  Theorem  1  in  section  2.3.  This 
leaves  only  the  third  step.  With  defined  as  above  and  with 

representing  the  analogous  sojourn  for  transitions  directly  from  states  in 

E,  to  states  in  E  ,  we  have  that  when  S  is  the  only  transient  state  in  E  : 
k  r .  i  • 

j  j 


f 


where  i-»n  represents  that  the  next  transition  is  from  i  to  n,  is  the 

holding  time  for  the  transition  i-»n  within  class  measured  in  the  "fast" 

time  scale  (so  that  is  this  holding  time  measured  in  scaled  time  where 

the  small  parameter  S  represents  the  time  scaling  from  "fast"  time  to  the 
scaled  or  "slow"  time)  ,  and  is  the  holding  time  for  transitions  from 

state  n  to  state  1  within  class  E  expressed  in  "fast"  time  units  (hence 

multiplication  by  S  scales  this  to  the  "slow"  time  scale) .  Proceeding  as  in 
[8],  this  expression  can  be  written  in  terms  of  the  transition  operator  for 
the  perturbed  process  as: 
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Prtr^>  St)-  Z  J'  Pri„<»  S  t-u)  dpf  <u) 
J  n£\  J 


Taking  the  Laplace  transform  of  the  entire  expression  and  using  the 
properties  assumed  for  the  perturbed  process,  we  have: 
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where : 


By  moving  all  of  the  terms  of  order  one  or  higher  in  e  and  S  to  the  right 
hand  side  of  the  expression  for  the  characteristic  function  and  moving  all 
of  the  zero  order  terms  to  the  left,  upon  taking  the  limit  as  e  and  S  go  to 
zero  we  get: 


0 


48 


The  assumption  that  class  has  a  unique  Cesaro  limit  sense  solution  for 
the  steady  state  distribution  of  the  unperturbed  process  within  class 


implies  that  the  characteristic  functions  ^^(s),  ^^^(s),  and  ^g^(s)  are 


all  independent  of  i.  This  yields: 
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for  which  the  solution  is  obviously: 


*kr.(s)  “  *kr.(s)  +  pSr.  *kS(s) 
J  J  J 


This  essentially  completes  the  proof  for  the  case  of  a  single  transient 
state  S  in  class  Er-  We  are  hoping  to  extend  this  proof  technique  to  the 

case  of  multiple  transient  states  in  Er. 
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III.  SUMMARY  OF  SIGNIFICANT  FINDINGS  AND  FUTURE  WORK 


3 . 1  Summary  of  Significant  Findings 

The  limit  theorems  derived  as  the  result  of  the  research  described 
above  extend  the  enlarged  Markov  process  approximation  to  all  of  the  model 
structures  that  have  been  observed  by  the  authors  in  constructing  semi- 
Markov  models  for  fault  tolerant  control  system  behavior.  This  is  of  great 
practical  significance  because  of  the  tremendous  reduction  in  computational 
overhead  that  is  realized  by  using  the  approximation  relative  to  solving  the 
model  exactly. 

The  key  results  of  the  research  are  the  limit  theorems  cited  above. 
These  include  the  extension  of  Korolyuk's  limit  theorem  to  time-scaled 
models  that  do  not  decompose  into  ergodic  classes  (Theorems  1  and  2  of 
section  2.3)  and  the  discrete  time  version  of  these  theorems.  Theorem  5  of 
section  2.4.  The  final  key  result  is  the  one  stated  in  section  2.6 
regarding  continuous  time  models  that  decompose  into  classes  that  have 
multiple  trapping  sets.  Because  fault  tolerant  system  behavior  often 
involves  situations  where  RM  decisions  alone  (without  failures)  can  lead  to 
more  than  one  outcome,  multiple  trapping  sets  within  a  class  are  a  common 
occurrence  in  models  of  this  behavior.  Therefore,  this  latter  result  is 
crucial  to  extending  the  approximation  to  essentially  all  of  the  fault 
tolerant  system  models  that  one  might  encounter  for  evaluation. 

Another  key  finding  results  from  the  error  analysis  described  in 
section  2.5.  It  was  found  that  for  the  two  models  considered  (the  single 
component  monitoring  system  and  the  single  component  dual  redundant  system) 
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the  class  probabilities  predicted  by  the  approximation  agree  with  the  exact 
solution  to  first  order  in  the  small  parameter  e  and  also  through  the 
dominant  second  order  term.  This  implies  that  the  error  in  this 
approximation  is  not  just  second  order  but  in  some  sense  "small"  second 
order  for  these  two  cases.  We  hasten  to  add  that  there  is  no  indication 
that  this  result  is  true  in  general,  but  it  is  rather  interesting  that  it  is 
true  for  two  widely  different  simple  cases. 

In  summary,  the  research  conducted  under  this  grant  has  laid  the 
mathematical  groundwork  for  the  practical  computation  of  approximate 
reliability  predictions  for  a  wide  range  of  fault  tolerant  systems  with 
random  dynamics  that  can  be  modelled  by  semi -Markov  processes.  This  is 
significant  because  such  models  are  too  complex  to  be  solved  exactly  even 
with  today's  extremely  powerful  computers.  The  approximate  method  applies 
to  all  discrete  or  continuous  time  models  of  fault  tolerant  systems  that  the 
authors  have  encountered  with  one  exception:  discrete  time  models  with 
multiple  trapping  sets  in  the  decomposed  classes.  It  is  believed  that  this 
single  exception  can  be  eliminated  (see  below) . 

The  availability  of  this  practical  reliability  evaluation  tool  can  have 
profound  impact  on  the  practice  of  designing  fault  tolerant  systems  that  are 
subject  to  random  RM  decision  errors  as  well  as  random  component  failures. 
Informed  design  decisions  can  now  be  made  based  upon  quantitative  system 
performance  results  that  were  too  difficult  to  compute  before  this  method 
was  developed.  The  ease  with  which  results  can  be  computed  also  makes 
possible  iterative  design  schemes  that  use  a  performance  measure  calculated 
by  this  technique  as  the  design  criterion. 
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3 . 2  Proposed  Future  Work 


A  proposal  has  been  submitted  to  AFOSR  to  continue  the  funding  of  this 
research.  The  proposed  research  includes  two  tasks  that  follow  directly 
from  the  research  reported  here.  One  of  these  is  to  extend  the  limit 
theorems  to  discrete  time  models  with  multiple  trapping  sets  in  the  classes 
in  order  to  eliminate  the  major  exception  cited  above.  This  should  be  a 
straightforward  task.  The  other  is  to  apply  the  approximate  technique  to 
more  sophisticated  models  than  those  that  have  been  examined  so  far.  As 
part  of  this  task,  an  effort  will  be  undertaken  to  generate  exact  results 
for  these  complex  cases  with  which  to  compare  the  approximate  results  in 
order  to  verify  the  accuracy  of  the  approximation.  Some  innovative 
computational  methods  on  extremely  high  throughput  machines  may  be  required 
to  do  this.  In  addition,  another  task  involves  the  examination  of 
approximating  quantities  other  than  the  system  reliability  and  state 
occupancy  probabilities  with  the  approximate  method. 
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Abstract 

Problems  associated  with  evaluating  state  probability  vector  of  large  state 
space  models  of  fault-tolerant  systems  are  explained.  Korolyuk’s  Limit 
Theorem  on  semi-Markov  processes  leads  to  a  solution  to  these  problems  by 
approximating  an  aggregated  version  of  the  original  semi-Markov  process  by  a 
reduced  order  Markov  chain.  The  Theorem  is  modified  and  extended  to  apply  to 
fault-tolerant  system  models  in  a  slow  time  scale.  The  approximate  technique 
is  then  completed  by  expanding  the  approximate  Markov  chain  states 
probabilities  with  the  limiting  probability  vector  that  apply  to  each 
decoupled  aggregate  class  of  states  of  the  original  semi-Markov  process.  The 
approximate  technique  is  demonstrated  on  a  couple  of  4-state  models  that  mimic 
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the  class-to-class  transition  structure  of  typical  fault-tolerant  system 
models.  The  results  show  that  accurate  approximation  is  achieved  for  these 
examples  after  a  short  transient  period.  In  addition,  the  ergodicity 
sufficient  condition  imposed  on  the  classes  of  the  original,  decoupled 
semi-Markov  process  by  Korolyuk’s  theorem  is  relaxed.  As  a  result, 
fault-tolerant  system  models  with  certain  types  of  non-ergodic  classes  can 
also  be  treated  by  the  approximate  technique. 

Keywords : 

Semi-Markov  Process 
Fault-Tolerant  System 
Enlarged  Process 
Reliability  Evaluation 
Approximate  Solution 
Transient  Analysis 
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1  INTRODUCTION 

1 . 1  Background 

A  faul t-tolerant  system  is  a  system  designed  with  redundant  capacity  to 
perform  its  function.  That  is,  it  can  do  its  job  using  more  than  one 
configuration  of  its  components,  e.g.  sensors,  actuators  and  information 
processing  components.  The  on-line  detection  and  isolation  of  failed 
components  and  the  subsequent  reconfiguration  of  the  system’s  operating 
architecture  is  performed  by  the  system’s  Redundancy  Management  (RM)  scheme. 

The  fault-tolerant  design  approach  enhances  system  reliability  and 
performance.  There  are  many  application  areas  where  ultra-high  system 
reliability  is  necessary  or  desirable.  One  such  area  is  the  control  of 
nuclear  power  plants  where  the  consequences  of  improper  control  system 
behavior  may  be  serious  indeed.  There  are  space  missions  for  which  the 
desired  operational  lifetime  of  the  spacecraft  is  many  years  during  which  time 
many  component  failures  are  probable.  The  air  traffic  control  system  and 
many  military  systems  are  also  subject  to  very  high  reliability  requirements. 
There  is  also  a  desire  for  increased  reliability  in  computerized  banking 
systems,  chemical  process  control  systems,  medical  monitoring  systems, 
transportation  systems,  and  many  more. 

Growing  attention  is  being  given  to  the  design  of  components  for  long  life, 
to  quality  control  during  manufacture,  and  testing  and  maintenance  policies 
which  enhance  reliable  system  operation.  Despite  these  efforts  to  improve 
the  reliability  of  individual  components,  the  resulting  system  reliability  is 
still  often  inadequate  for  some  reliability  requirements.  As  a  result,  there 
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is  increasing  interest  in  fault-tolerant  system  designs  which  allow 
components  to  fail  but  still  provide  a  means  for  the  system  to  continue  to 
function. 

The  growing  use  of  fault-tolerant  system  designs  has  in  turn  spurred 
interest  in  methods  for  assessing  the  reliability  and  performance  of  such 
systems.  The  traditional  methods  of  reliability  evaluation  are  based  on 
combinatorial  analysis  of  combinations  of  component  failures*  amd  these 
analyses  seldom  account  for  the  probabilistic  nature  of  the  outcomes  of  the 
on-line  monitoring  tests  that  are  used  to  detect  and  identify  failures  and  to 
reconfigure  the  system  .  In  addition,  classical  reliability  analysis  produces 
as  its  sole  result  the  probability  that  the  system  will  maintain  its 
integrity  over  the  duration  of  its  operating  time.  No  information  is 
provided  on  the  performance  of  the  system  during  the  transient  period  of  the 
mission. 

Since  classical  reliability  analysis  fails  to  quantify  fault- tolerant  system 
time  behavior,  other  alternatives  must  be  considered.  Naturally.  Monte  Carlo 
simulation  is  one  option.  However,  for  the  systems  we  are  interested  in, 
complex  and  with  low  component  failure  rate,  a  huge  number  of  simulations  is 
required  to  generate  statistically  significant  results,  and  it  is  often 
prohibitively  costly. 

2  3 

The  use  of  Markov  chain  theory  ’  has  shown  promise  as  a  means  for  evaluating 
the  performance  of  fault-tolerant  systems  that  employ  Fault  Detection  and 
Isolation  (FDI)  tests  of  the  single  sample  variety,  i.e.,  the  information 
that  is  used  for  FDI  is  gathered  and  discarded  at  each  time  sample.  Methods 
have  been  proposed  to  deal  with  the  problems  associate  with  large  Markvo  chain 
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models,  aggregation/disaggregation  technique  by  Takahashi  ,  decomposition 

5  6 

technique  proposed  by  Courtois  and  aggregation  technique  by  Bobbio  .  In 

addition  to  the  problem  of  large  state  space,  these  efforts  also  consider  the 

issue  of  stiffness,  i.e.  the  simultaneous  presence  of  transition  rate  of 

different  orders  of  magnitude  in  the  model.  The  former  two  techniques  are  for 

solving  the  steady  state  probability  vector  and  only  the  latter  is  for 

transient  analysis  of  large  stiff  Markvo  chains.  However,  single  sample  FBI 

tests  generally  have  a  relatively  high  likelihood  of  decision  errors, 

particularly  in  noisy  signal  environments.  In  such  situations,  the  FDI  tests 

are  usually  based  on  several  samples  of  the  monitoring  data  at  each  time 

7  8  9 

sample,  e.g.  moving  window  tests  and  sequential  tests  ’  *  .  Such  tests  are  not 

memory less.  Therefore,  Markov  chain  analysis  does  not  apply  to  systems 

employing  these  types  of  tests. 

8  10 

Some  effort  '  has  been  made  to  analyze  such  systems  and  it  appears  that 

11  12 

generalized  Markovian  (or  semi-Markov  ’  )  modeling  methods  are  applicable  to 

some  systems  of  this  type.  Semi-Markov  processes  are  very  similar  to  Markov 
chains,  but  have  an  extra  degree  of  freedom  that  makes  them  well-suited  for 
capturing  the  random  delay  behavior  of  RM  decisions  for  nonmemoryless  FDI 
tests.  However,  a  problem  with  this  reliability  evaluation  method  is  that 
the  large  number  of  states  in  the  model  causes  the  computation  of  results  to 
involve  excessive  amounts  of  computer  storage  and  computation  time.  The 
reason  for  this  is  that  standard  time-invariant  semi-Markov  theory  requires 
the  solution  of  a  matrix  convolution  integral  equation  to  find  the  interval 
transition  probability  matrix. 
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For  complex  systems  with  a  large  number  of  states  in  their  model  (for 

13 

example,  model  for  a  dual-redundant  engine  controller  examined  has  30  states 

and  flight  control  system  models  will  have  many  more),  it  becomes  intractable 

to  obtain  a  solution  either  analytically  or  numerically.  More  specifically, 

consider  a  discrete  time  semi-Markov  model  with  state  probability  vector  w(k) , 

at  time  step  k.  If  l(0)  is  known,  then  w(k)  can  be  expressed  as, 

I( k)  =  1(0)  <&k)  (1) 

12 

where  <£(k)  is  recursively  generated1  by, 

k 

gk)  =  >W(k)  +  )  [P  0  H(m)]  gk-m)  ,  &0)=I  (2) 

mk> 

It  can  be  seen  that  a  convolution  sum  is  involved.  This  implies  that  for  a 

2 

system  with  N  states,  approximately  2kN  values  must  be  stored  in  order  to 

compute  <j)(k)  and  hence  w(k) .  For  N  «  20  and  k  =  100,000,  as  might  be  the 

case  for  a  simple  flight  control  system  operating  with  RM  updates  at  a  rate  of 

0 

50Hz  for  35  minutes,  the  storage  required  is  approximately  80x10  values  or 

640  megabytes  of  storage  for  accurate  single  precision  state  probability 

distribution  calculations.  The  number  of  floating  point  multiplications 

12 

required  for  calculating  ^(100,000)  is  approximately  7x10  .  The  same  problem 

14  15  16  17 

arises  for  continuous  time  models.  Several  authors1  ’  ’  ’  have  developed 

algorithms  for  transient  analysis  of  special  semi-Markov  process.  They  allow 
the  aggregates  of  fast  states  to  constitute  a  semi-Markov  or  a  more  general 
stochastic  process  and  Trivedi  et  al .  extend  Stiffler’s  approach  to  allow  for 
slow  transition  out  of  fast  states.  However,  these  attempts  in  the  area  of 
reliability  analysis  were  tailored  to  particular  system  structures  and  provide 
no  analytical  approximate  solution  in  term  of  parameters  of  kernel  elements  of 
the  large  and  stiff  semi -Markov  process. 
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This  paper  discuss  an  aggregation  approximation  technique  explicitly  intended 
to  aggregate  states  so  that  the  resulting  system  is  small  state  space  and  the 
transitions  between  aggergate  states  approximated  by  Markov  chain.  The 
technique  proposed  in  this  paper  can  be  applied  to  stiff  semi-Markov  models  of 
fault-tolerant  systems.  Generally  speaking,  the  states  of  the  model  each 
represent  a  different  system  operating  configuration  in  terms  of  number  of 
working  components,  components  in  use  and  failure  monitoring  status  and  states 
are  aggregated  according  to  the  number  of  working  components  but  with 
different  RM  configuration.  The  algorithm  proceeds  by  first  establishing  the 
Markvoian  behavior  between  aggregates  of  states  of  a  semi-Markov  process.  The 
resulting  Markov  chain  is  then  analyzed  by  a  standard  analytical  or  numerical 
technique.  Approximate  solution  is  completed  by  expanding  the  total 
probability  in  each  aggregate  of  states  by  the  stationary  probability  vector 
in  that  aggregate  of  states. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  derives  how  the 
aggregates  of  states  of  a  semi-Markov  process  can  be  approximated  by  a  Markov 
chain.  Section  3  relaxes  the  sufficient  condition  imposed  on  the  aggregates  of 
states  by  the  Theorem.  The  approximate  technique  is  then  demonstrated  with 
two  numerical  examples  in  Section  4. 

2  THEORY  OF  APPROXIMATE  AGGREGATE  TECHNIQUE 

2.1  Introduction 

Assuming  stiff  semi-Markov  models  of  faul t-tolerant  systems  of  interest  have 
transition  kernel  elements  of  the  following  form  (generally  they  are,  see 
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9-state  model  of  a  three  components  redundant  faul t-tolerant  system  in  the 


literature  )  : 

P.  .  (t)  =  p£.  F.  .(t) 
ij  ij  ij 


i.j  €  E; 


e 

Pij  = 


(k)  (k) 

PU  '  £qj 


i>j  €  Ek 


V  J€Ek 


where  S  ,p(^  =  1 
L  ij 

j€Ek 


£  Ek  1  <  k  i  m. 


where  e  is  small  parameter  associates  with  failure  rate  of  components,  and  we 

can  aggregate  the  model  into  classes  of  states  Ei ,  l£i<m,  that  is  there  is  no 

transition  across  each  aggregate  of  states  when  e  tends  to  zero.  It  was 
19 

shown  that  state  i  probability  can  be  approximated  as: 

^(t)  *  irik**J(t)  (5) 

(k)  e 

where  '  is  the  stationary  probability  of  state  i  in  class  E^  and  ir^(t)  is 

the  probability  of  kth  aggregate  of  states.  So  we  have  an  approximate  solution 

if  we  can  find  the  probability  of  aggregates  of  states,  since  stationary 

probability  in  each  class  can  be  evaluated  by  standard  semi-Markov  theory. 


2.2  Korolyuk’s  Limit  Theorem  for  Semi-Markov  Processes 

20,21 

Literatures  ’  describe  sufficient  conditions  under  which  a  perturbed 
semi-Markov  process  can  be  approximated  by  a  Markov  chain.  There  are 
essentially  two  conditions.  First,  the  kernel  of  the  semi-Markov  process  must 
depend  on  a  small  positive  parameter  a  in  such  a  way  that  the  state  space 
of  the  semi-Markov  process  E  can  be  split  into  disjoint  classes  of  states 
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£=2^™^  ,  such  that  the  probabilities  of  departure  from  each  class  and  of 

the  sojourn  time  in  a  given  state  both  tend  to  zero  with  a.  The  total  sojourn 
time  in  each  class  is  assumed  to  have  a  nondegenerate  distribution  in  the 
limit  as  a  -+  0.  When  a=0,  the  resulting  process  will  be  referred  to  as  the 
rwn-per turbed  semi-Markov  process  while  the  original  process  will  be  referred 
to  as  the  perturbed  semi-Markov  process.  This  condition  can  be  expressed  by 
the  following  equations, 

Pij  (t)  =  Pij  Fijft/e)  i.j€E;  (6) 

(k)  (k) 

£  PiJ  ‘  £qj 

Pij  "  (k) 

I  £qij 


where  J  pP^  =  1  i€Ek  1  £  k  £  m. 

^€Ek 

where  p^  is  the  eventual  transition  probability  of  the  original  process  from 
state  i  to  state  j  and  F^.ft/e)  is  the  Cumulative  Distribution  Function  (CDF) 
of  the  holding  time  for  transitions  from  state  i  to  state  j. 

Second,  the  Markov  chains  defined  by  the  transition  probabilities 

i  j 

(i.jeE^  l^k^m)  ,  must  be  ergodic  with  stationary  probabilities  i ir^)  (iCE^. 

i 

l^k^m).  When  these  conditions  are  satisfied  by  a  perturbed  semi-Markov 


process,  then  its  behavior  can  be  approximated  by  a  Markov  chain.  More 
specifically,  if  Tkr}  is  the  sojourn  of  the  semi-Markov  process  in  class 
when  it  begins  from  state  i€E^  and  moves  to  class  Er>  then  the  Theorem  shows 
that  the  cumulative  distribution  function  of  approaches  an  exponential 

function  as  a  becomes  vanishingly  small: 

t“"o  Pr(  Tk^  <  ‘  )  -  Pkr  <  1  -  «'V  ) 


(8) 
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As  can  be  seen  from  the  above  equation,  the  dependence  on  i  disappears  on  the 
right  hand  side  of  the  equation.  That  is,  each  state  in  class  has  the  same 
asymptotic  exponential  holding  time  density  function  for  transitions  to  class 
Er-  Therefore,  all  the  states  in  class  E^  can  be  merged  together  and  the 
aggregated  model  has  the  characteristics  of  a  Markov  chain. 

In  a  sophisticated  fault-tolerant  system  there  is  Built-In  Test  Equipment 
(BITE)  included  in  the  RM  system  in  order  to  recover  a  working  component  that 
is  incorrectly  detected  as  failed  and  isolated,  so  a  component  that  was 
previously  isolated  as  failed  by  the  RM  can  be  brought  back  on  line.  For  this 
kind  of  system,  the  imbedded  Markov  chain  within  each  class  is  usually 
ergodic.  Then  the  ergodic  condition  is  satisfied.  Moreover,  by  comparing  the 
fault- tolerant  system  model  decribed  by  Eq.(3)-(4)  and  the  semi-Markov  model 
decribed  by  Eq.(6),  our  system  model  of  interest  satify  all  the  conditions 
imposed  by  the  Theorem  except  the  condition  defined  by  Eq.6.  Usually,  this 
condition  is  not  satisfied  by  a  fault-tolerant  system  model.  The  reason  for 
this  is  as  follows-’  If  e  is  small,  i.e.  the  Mean  Time  To  Failure  (MTTF)  of 
the  components  is  large,  say  hundreds  of  hours,  then  the  holding  time  of  the 
transition,  particularly  those  within  a  class,  is  determined  only  by  the 
noise  in  the  signals  and  the  threshold  set  by  the  FDI  test  designer.  So,  as 
the  failure  rate  tends  to  zero,  the  RM  decision  delay  will  not  be  affected  by 
the  failure  rate.  Therefore,  the  transition  kernel  of  a  fault- tolerant  system 
model  will  not  take  on  the  form  implied  by  Eq.(6) 

2.3  Derivation  of  ♦jcr(s)  of  a  Time-Scaled  Perturbed  Semi-Markov  Process 

However,  instead  of  viewing  the  semi-Markov  process  holding  time  from  state  i 


I 


to  state  j  depend  on  the  parameter  e,  we  could  view  the  holding  times  are 
being  in  a  different  time  scale  compared  to  the  aggregated  model  class  to 
class  transition  holding  times.  The  constant  relating  the  two  time  scale  is 
the  small  parameter  e.  So  Eq.(6)  will  represent  the  semi-Markov  process  in  a 
different  time  scale  compared  with  the  Markov  chain  of  the  aggregates  of  state 
and  the  time  scale  could  be  relaxed  to  any  small  parameter  6, 

<9> 

With  the  properties  of  the  semi-Markov  process  already  stated  in  last  section, 
we  could  proceed  with  the  derivation  of  parameters  for  the  Markov  chain  of  the 
aggregated  mode 1 . 

Let  denote  the  sojourn  of  the  semi-Markov  process  in  state  i,  with  the 
CDF  Fj^t),  while  6^  be  binary  indicators  of  transition  from  state  i  to  the 
state  j,  so  that  E{5^j}=p^.  Then,  the  random  quantities  can  be  obtained 
by  using  the  expression  for  total  probability  : 


p<  i  '  >  -  I  P<  5‘r1'  6flj  +  ‘  '  > 


p<  «,!!.) 


Hence 


'  p<  ‘ >  dP?j<u> ptj<‘> 


P{  i  t  }  = 


Using  the  Laplace  transform, 

=  E  {  e~STkr  ) 

Ptj(s)  ■  £  e_St  dP?j<t> 

then,  Eq.(10)  becomes, 

♦k?)(s)  *  J^ki,(s>  pij(s> +  ^  pij(s> 
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Combining  the  Laplace  transform  of  Eq.(7)  and  Eq.(9)  : 


Pij(S)  = 


-  eq^}  {1  -  5sa.  .  +  0(5)} 


5q^  +  0(5) 


^  €Ek 

1  €  V  J'  €  Ek 


Substituting  these  expressions  in  Eq.(14),  it  becomes. 


- 


(k)  .(j)fs)  =  _ 

pij  9kr  (S) 


«sa.jPW  ♦  «$>)  ^(s) 


*  e  >  ,W  +  0(e  6) 


Passing  to  the  limit  as  t  and  6  -*  0.  the  functions  ^^/(s)  are  found  to 
satisfy  the  system  of  equations, 

-  I  pSj’  4^<s)  - 0  <i7> 


It  follows  from  this  and  the  assumption  that  the  imbedded  Markov  chain  defined 

(k) 

by  the  transition  probabilities  p.  .  (  i.  j  €  E.  )  is  ergodic,  that  the 

1 J  K 

22 

solution  of  system  Eq.(17)  is  independent  of  the  superscript,  i.e.  for  all  i 

€  E^,  #5^(s)=$kr(s)  •  Multiplying  Eq.(16)  by  the  stationary  probabilities 
(k) 

7r^  '  and  summing  over  all  i  €  E^,  then  cancelling  e,  the  following  is 
obtained, 


l  "m.  1  s  au  *  i{j  >  (s>= 

i€Ek  1  JeEk 

t  -00 


On  passing  to  the  limit  as  e-O,  noting  that  all  the  <£^/(s)  have  the  limit 
function  (s),  we  obtain: 

*kr(s)  =  l*  s  a*j  +  qS5>) 


(19) 


Table  1:  State  definitions  and  class  decompositions  for  SCMS-I,II 


State 

State  Definition 

Class 

1 

Component  is  working 

1 

2 

Component  has  a  false  alarm 

1 

3 

System  loss  -  component  failed 

2 

5.1  SCMS  with  continuous  monitoring 

Table  1  enumerates  and  defines  the  states  of  a  semi-Markov  chain  reliability  model  of  the 
SCMS-I.  The  dashed  line  in  the  table  distinguishes  the  class  decomposition  of  the  model: 
class  1  contains  states  1  and  2,  class  2  contains  only  state  3. 

The  semi-Markov  transition  diagram  for  the  SCMS-I  is  presented  in  Figure  1.  Two 
aspects  of  this  diagram  should  be  noted.  Given  that  the  chain  has  entered  a  state,  the  lines 
directed  out  of  that  state  represent  transitions  after  the  chain  has  remained  in  that  state  for 
a  period  of  time,  namely,  the  holding  time.  Secondly,  the  dashed  lines  represent  transitions 
whose  transition  PMFs  are  proportional  to  c.  Thus,  a  dashed  line  represents  the  condition 
that  no  such  transition  occurs  when  e  =  0.  This  is  a  convenient  way  of  depicting  the  class 
decomposition  of  a  semi-Markov  chain  reliability  model. 
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aggregated  model  can  be  approximated  by  a  Markov  chain.  The  parameters  of 
this  process  are  given  by  Eq. (21)-(27) . 

Note  that  the  approximate  Markov  chain  evolves  in  a  longer  time  scale,  i.e. 
1/6  times  that  of  the  original  process.  For  instance,  if  6=1/3600  and  if  the 
original  semi-Markov  model  evolves  in  seconds  then  the  approximate  enlarged 
Markov  process  will  evolve  in  hours. 

One  of  the  sufficient  conditions  in  the  derivation  in  this  section  for  the 
enlarged  process  is  that  all  the  disjoint  classes  must  be  ergodic  when  e=0. 
This  condition  is  usually  not  satisfied  by  all  fault-tolerant  system  models. 
However,  relaxation  of  this  condition  is  discussed  in  the  next  section 

3  RELAXATION  OF  ERGODICITY  CONDITION 

The  second  sufficient  condition  stated  in  the  last  section  for  the  approximate 
Markov  chain  to  be  non-trivial  is  that  the  imbedded  Markov  chain  of  the 
non-pei curbed  process  within  each  class  must  be  ergodic.  However,  this 
condition  can  be  relaxed  and  Korolyuk's  Theorem  can  be  modified  as  follows. 

Theorem  1.  If  a  semi-Markov  process  depends  on  a  small  parameter 
e  such  that  its  state  space  can  be  partitioned  according  to 
Eq  (7)  and  is  time-scaled  according  to  Eq.(6)  and  additionally  if 
tho  transition  probability  operators  for  the  imbedded  Markov 
chain  of  the  k-th  class  of  the  non-perturbed  semi-Markov  process 
satisfy: 

n 

lim  n  1  Pk  =  [  2  s**e  ]T 


(28) 
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where  e  is  column  vector  with  sum  of  its  elements  equals  to  1. 
Then  the  aggregated  semi-Markov  process  can  be  approximated  by 
the  Markov  chain  defined  by  Eq.(20). 


Proof.  The  proof  follows  an  identical  line  of  reasoning  to  the 

proof  in  section  2  until  the  point  where  the  functions 

are  shown  to  satisfy  the  system  Eq.(17).  The  system  equation  can 


be  rewritten  in  linear  equation  vector  form: 

vT 


4r(s)  *  Pk  4r(s> 


(29) 


Premultiplying  the  above  equation  by  the  operator  and  using 


Eq. (29)  on  the  result  gives: 

4r<s>  -  'k  *kr<s>T 


(30) 


By  successively  premultiplying  the  system  of  equations  and 
replacing  the  left  hand  side  by  ^r(s),  and  averaging  an  infinite 
number  of  these  equations 


4r<‘>  *  [  n"".  5  I  PJ  ]  4r<*> 

1  =  1 


(31) 


Since  the  operator  Pk  defined  by  p^  satisfies  Eq.(28),  then,  by 
linear  equation  theory,  the  solution  of  the  system  of  equations 
in  Eq. (31 )  is  a  vector  with  all  its  elements  being  the  same,  that 
is  for  all  i€Ek.  ^(s)  =  *kr(s). 

The  remainder  of  the  proof  that  the  aggregated  model  is  Markovian 
and  the  derivation  of  parameters  of  the  approximate  Markov  chain 
will  be  exactly  the  same  as  that  of  the  remainder  of  the  proof  in 


section  2. 
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This  extended  Theorem  is  a  relaxation  of  the  ergodicity  sufficient  condition 
stated  earlier  in  Chapter  2  imposed  on  the  semi-Markov  process  to  be 
approximated . 

It  is  of  interest  to  find  conditions  under  which  Eq.(3.1)  is  satisfied.  Along 
these  lines,  the  following  theorem  is  established: 


Theorem  2  If  the  imbedded  Markov  chain  which  is  defined  by  the 
transition  operator  of  the  k-th  class  of  the  non-perturbed 
semi-Markov  process  is  (1)  ergodic,  or  (2)  is  non-ergodic  with 
one  and  only  one  eigenvalue  of  unity,  then  the  operator 
satisf ies  Eq. (27) . 

Proof.  (1)  By  ergodic  theorem, 

^  pi  =  =  C  £  e-**e  ]T  (32) 

and, 


n 

r  , 

n 

V  nl  lim  , 

r  i 

V  01  _L 

V  nl  \ 

2  Pk  3  n  ^  “ 

t  n 

)  P,  + 

L  k  n-r 

2  pk  / 

1  =  1 

1=1 

l=r+l 

where  r  is  finite  but  large  such  that. 


Therefore,  Eq.(33)  can  be  reduced  to  : 

n  1  n 

lim  1  V  pl  _  lim  V  pl 

n  -»  00  n  ,Z,  k  ~  n  -»  »  n-r  ,  Z, ,  k 
1=1  l=r+l 

By  Eq . ( 34 )  it  foil ows : 


lim 


n 


n  ,LPk  =  ffk  "  C  ®  s'"s  J1 


1=1 


which  proves  the  Theorem  for  this  case. 


(33) 


(34) 


(35) 


(36) 
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(2)  The  operator  P^  can  be  put  in  Jordan  form  by  the  following 
transformation: 


Pk  =  T  Ak  I'1  (37) 

where  T  is  a  square  invertible  matrix  with  columns  made  up  of  the 
right  eigenvectors  (or  generalized  right  eigenvectors)  of  the 

operator  P^.  By  a  proper  ordering,  has  the  form: 

X, 


0 


(38) 


where  {  •••  Xp  }  are  the  unit  magnitude  eigenvalues  and  J  is  a 

Jordan  form  matrix  containing  all  the  eigenvalues  of  less  than 
unit  magnitude  on  its  main  diagonal.  (This  form  is  known  to  exist 
for  a  stochastic  matrix  P^  because  the  unit  magnitude  eigenvalues 
must  have  a  full  set  of  linearly  independant  eigenvectors.  ) 
Therefore: 

n 

i  1 


n 


n 


lim  1  T  pl  lim  1  T  [  T  A  T_1 

n  ■»”  n  Z  rk  ~  n  ->  00  n  Z  [  k 

1=1  1=1 


=  T 


n 

lim  1  V  A1 
n  -»  00  n  Z  k 
1=1 


(39) 
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Since  has  one  and  only  one  eigenvalue  of  one: 
n 

n^^co  ^  |  Aj|  s  diagonal  matrix  with  a  single  non-zero 

1=1 

element  of  unity  on  its  main  diagonal 

(40) 

because  =  0  and  =  0.  Because  P, 

is  a  stochastic  matrix,  the  right  eigenvector  appearing  in  the 
column  of  T  corresponding  to  the  unit  eigenvalue  is  a  column 
vector  with  all  its  elements  equals  to  1,  i.e.  [I].  Therefore: 


11 

T  limm  -  5  a£  1  =  [  0**«1«**0  ] 

n-»«°n£  k  J  L  ~  -  -  J 


Therefore: 


That  is. 


k*  TAk 

n  -*  « 

which  completes  the  proof. 


=  T  [  ] 

r  iT 

=  [  £  e***e  J 

(42) 

r  -iT 

=  [  §  e***e  J 

(43) 

As  an  illustration  of  the  implication  of  the  sufficient  condition  stated  in 
the  Theorem  2,  valid  and  invalid  examples  of  state  transition  structures  are 
shown  in  Figure  1.  Note  that  one  of  the  invalid  examples  in  Figure  lb  has  2 
eigenvalues  of  one  because  2  trapping  sets  of  states  are  present  in  single 
class . 

As  a  result  of  the  Theorems  above,  the  state  probability  values  for 
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fault-tolerant  system  models  with  non-ergodic  classes  that  satisfy  the 
condition  stated  in  Eq.(28)  will  be  approximated  well  by  the  approximate 
technique  developed  in  this  paper.  Note  that  there  may  exist  fault- tolerant 
system  models  that  can  also  be  treated  by  the  approximation  technique  even 
when  the  conditions  are  violated  because  the  Theorem  is  a  sufficient  but  not 


necessary  condition. 


s 


to  and  from  E.,  E^€E  and  i^j 


to  and  from  E.,  E.€E  and  i*sj 
11 


to  and  f  rom  E .  .  E .  €E  and  i;* j 
11  J 


Figure  la:  Valid  non-ergodic  classes 
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E. 

J 


to  and  f  rom  E . ,  E . €E  and  i/ j 
11 

Figure  lb:  Invalid  non-ergodic  class 

4.  TESTS  OF  APPROXIMATE  TECHNIQUE  WITH  4-STATE  MODLES 

In  this  section,  we  use  4-state  models  to  demonstrate  the  use  of  the 
approximate  technique  with  relaxed  ergodicity  condition  developed  in  the 
preceding  sections.  Four-state  models  are  the  dimensionally  smallest  models 
that  can  be  aggregated  and  can  include  class-to-class  behavior.  The 

1  Q 

approximate  technique  has  also  been  successfully  applied  to  a  9-state  model  , 
but  the  computation  required  to  generate  the  numerical  exact  solution  in  such 
case  starts  very  costly. 

Two  cases  will  be  considered  here.  In  case  1,  there  are  two  ergodic  classes 
where  the  second  class  is  a  trapping  class.  The  next  example,  Case  2,  consists 
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of  two  non-ergodic  classes  where  class  2  is  a  trapping  class. 

4.1  Case  1 

The  schematic  state  transition  diagram  for  the  semi-Markov  process  is  shown 
in  Figure  2. 


class  1 


class  2 

Figure  2:  State  transition  diagram  for  Case  1 

The  process  can  be  decomposed  into  two  classes,  namely  class  1  and  class  2, 
when  e=0.  Class  1  comprises  states  1  and  2  and  class  2  comprises  states  3  and 
4.  The  transition  from  class  1  to  class  2  is  through  the  small  eventual 
transition  probability  e  from  states  1  and  2  to  states  3  and  4.  However,  state 
3  and  state  4  cannot  transit  back  to  any  of  the  states  in  class  1,  hence 
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class  2  is  a  trapping  class.  The  governing  transition  kernel  matrix  is  given 
by  the  following: 


0 

(l-6e)X2e"X2 

0  .2  -X,  t 

2eX^  te  1 

.  .2.  -X0t 
4eX2te  2 

(O.S-TeO^e'^ 

(0.7-2£)X2e'X2r 

c  ,  2„  -X. t 
6eXj  te  1 

2eX^te~X2l 

P(t)  = 

0 

0 

r>  2„  -X.  t 

0.4Xjte  1 

0.6X2te_X2t 

0 

0 

A  -X.  t 

0.5X^te  1 

0.5X2te_X2C 

(44) 

-6  -1 

where  X^=0.2,  X2=0.1,  e=2.5xl0  ,  (all  units  are  in  sec  ). 

It  is  assumed  that  the  initial  condition  on  the  state  probability  vector  is. 

2E(0)  =  [  1  0  0  0  ]  (45) 

One  point  about  this  model  should  be  emphasized.  That  is  that  the  holding 
time  density  functions  for  the  transitions  from  states  in  class  1  to  states  in 
class  2  and  those  within  class  2  are  2nd  order  Erlang  PDFs.  These  are 
non-exponential  holding  time  density  functions.  Therefore,  this  is  a 
semi-Markov  model. 


4.1.1  Stationary  probability  distribution  of  the  non-perturbed  semi-Markov 
process 


By  setting 

e=0  and 

dropping 

all 

the  holding 

time  density 

functions  in 

the 

transition 

kernel 

matrix. 

the 

transition 

probability 

matrix  of 

the 
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non-perturbed  Markov  chain  is  found  to  be 


0.4  0.6 


0.5  0.5 


Note  that  each  class  of  the  nonperturbed  process  is  ergodic  for  this  case. By 
raising  the  single  step  transition  probability  matrix  successively  to  higher 
powers,  the  stationary  interval  transition  probability  matrix  is  found. 


The  stationary  probability  vectors  of  the  non-perturbed  imbedded  Markov  chain 
in  class  1  and  2.  respectively,  are: 

2^  =  [  0.2308  0.7692  ]  (47) 

=  [  0.4545  0.5455  ]  (48) 

The  mean  waiting  times  for  the  states  in  class  1  are. 

1 

*  p12  7^  ■  10  <49> 


t2  =  P21  Xj  *  p22  \2  ~  8  5 

Therefore  the  mean  waiting  time  for  class  1  is 


=  k\ 


ri  =  8.8462 


Hence,  the  stationary  probabilities  in  class  1  are 


irM1rl 


=  0.2609 


,d>  » 


V2 


=  0.7391 
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or  in  vector  form, 

=  [  0.2609  0.7391  ] 

The  mean  waiting  times  for  the  states  in  class  2  are: 
2  2 

T3  =  P33  ~  +  p34  =  16 

2  2 

T4  =  P43  ~  +  P44  "1^  =  15 


Therefore  the  meaning  waiting  time  of  the  process  in  class  2  is, 

w 

i 


r  =  y  7T„  t,  =  15.4545 

ife„  Mi 


Hence,  the  stationary  probabilities  in  class  2  are. 


T(2)  = 
3 


r(2)  . 


V3 


TM.T4 

4 


=0.4705 


=  0.5295 


(54) 

(55) 

(56) 

(57) 


(58) 


(59) 


or  in  vector  form, 

=  [  0.4705  0.5295  ] 


(60) 


4.1.2  Approximate  Markov  chain 

In  both  of  the  cases  in  this  section,  the  time  scale  factor  6  is  assumed  to 
equal  e  for  evaluating  the  approximate  Markov  chain.  The  Laplace  transform  of 
the  kernel  element  for  transition  from  aggregated  "state"  1  to  aggregated 
"state"  2  is  given  by  Eq.(20)  with  parameter  defined  by  Eq. (21)-(27) . 

From  the  transition  kernel  matrix  in  Eq.(44), 

q(}}  =  6  (61) 

q^  =7+2=9  (62) 
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Substituting  all  the  numerical  quantities  in  Eq.(23): 
0.2308x6  +  0.7692x9 


A,  = 


=  0.9391 


1~  0.2308x10  +  0.76992x8.5 
Obviously  from  the  structure  of  the  class  to  class  transitions, 

p12=  1 
Therefore, 

0.9391 

*12^  =  s  +  0.9391 

or  in  the  scaled  time  domain, 

.  ,„,x  „  -0.9391  t‘ 

♦igC t  )  =  0.9391  e 

Since  there  are  only  two  classes  and  class  2  cannot  transit  to  class 

approximate  probability  in  class  2  is  given  in  scaled  time  by, 
r*t 1 


(63) 

(64) 

(65) 

(66) 

1 ,  the 


■  J0  dT 


=  1  -  e 


-0.9391  t' 


(67) 


and  the  approximate  probability  in  class  1  in  scaled  time  is, 

»?(f)  =  1  ~  »J(f) 


=  e 


-0.9391  t' 


Converting  to  the  original  time  scale  by  using  6=2.5x10  ,  this  becomes 


e,  «  -0.9391  x  2.5x10  t 

»x(t)  =  e 


e,,,  ,  -0.9391  x  2.5x10  6  t 

^o(c)  =  1  '  e 


(68) 

(69) 

(70) 


4.1.3  Exact  solution  of  the  original  semi-Markov  process 
In  order  to  determine  the  accuracy  of  the  approximate  results  above,  it  is 
necessary  to  generate  the  exact  solution  to  the  original  semi-Markov  model. 
The  exact  solution  will  be  calculated  analytically  by  using  the  Laplace 
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12 

transform  technique  as  in  Eq. (11.5.6)  .  Although  there  are  only  four  states 

in  the  model  to  be  solved,  the  manipulation  will  require  the  use  of  a  powerful 
symbolic  manipulation  program  called  MACSYMA. 

Two  of  the  elements  of  the  interval  transition  probability  matrix  are  obtained 
by  this  procedure  as  follows: 


ir^t)  =  0.030115  [  e”2‘ 35x10  c  -  e-0-229"815  t  ] 


♦  0.230749  [  e-2-36xl0-6  t  _  .-0.22999815  t  ^ 

+  5. 384780* 10-7  t  e"0,1  *  +  0.538474  e~°A  C 
+  2.833533xl0“5  e"°‘2  *  (71) 

ir2(t)  =  0.939775  [  e"2355*10  1  -  e-°-229"815  t  ] 

4  0.538533  [  e-2  3Sx10'6  ‘  -  e-0-2299981S  t  j 

-  5.769362xl0"7  t  e"0,1  C  -  0.538483  e~°A  1 

-  5 . OOOOOOx 10_5  e~°2  *  (72) 

The  total  probability  in  class  1  is  given  by: 

pEi(t)  =  Ti(c)  +  *2(t)  (73) 


4.1.4  Comparison  of  results 

The  approximate  and  exact  total  probabilities  in  class  1  at  different  time 
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points  are  compared  in  Table  1. 


Table  1 

Comparison  of  approximate  and  exact  probability  in  class  1 


t/sec . 

approximate  class 
probability  Tr^(t) 

exact  class 

probability  P_  (t) 

1 

0.99999 

1.00000 

5 

0.99999 

1.00000 

10 

0.99998 

0.99999 

50 

0.99988 

0.99990 

100 

0.99977 

0.99978 

500 

0.99883 

0.99884 

1000 

0.99765 

0.99767 

5000 

0.98833 

0.98834 

10000 

0.97680 

0.97697 

The  results  indicate  that  errors  in  the  aproximation  occur  only  at  the  fifth 
decimal  place  up  to  t=1000Q  sec.  with  the  maximum  relative  percentage  error 
occuring  at  t=1000  sec.  with  a  value  of  only  0.0002%.  This  shows  that  the 
class  probability  is  well  approximated  by  the  approximate  Markov  chain. 

After  the  class  probability  results  have  been  compared,  the  transient 
normalized  probability  vector  within  class  1,  as  shown  in  Table  2,  is  compared 
with  the  stationary  probability  vector  of  the  non-per turbed  semi -Markov 
process  in  that  class  which  is  given  in  Eq.(54). 
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Table  2 

Normalized  probability  distribution  in  class  1 


t/sec . 

state  1 

state  2 

1 

0.9075 

0.0925 

5 

0.6510 

0.3490 

10 

0.4791 

0.5209 

40 

0.2707 

0.7293 

100 

0.2609 

0.7391 

200 

0.2609 

0.7391 

It  is  easy  to  see  that  there  is  no  error  up  to  4  decimal  places  between  the 
stationary  normalized  probability  vector  and  stationary  probability  vector 
after  a  transient  period  of  100  seconds.  This  implies  that  the  original 
semi-Markov  process  solution  is  approximated  to  within  0.0002%  error  by  the 
approximate  solution  after  a  transient  period  of  100  seconds  at  the  beginning 
of  the  mission.  The  transient  period  is  about  12  times  the  minimum  mean 
waiting  time  among  the  states  of  the  non-perturbed  process  in  class  1  and 
0.025%  of  the  MTTF. 

4.2  Case  2 

For  some  fault-tolerant  system  semi-Markov  models,  there  may  be  trapping 
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states  among  some  classes  of  states.  Under  these  circumstances,  the  imbedded 
Markov  chain  of  those  classes  will  be  non-ergodic.  In  order  to  demonstrate  the 
relaxed  approximate  technique,  in  this  example,  a  model  with  two  non-ergodic 
classes  is  created  where  each  class  consists  of  two  states.  The  state 
transition  diagram  is  shown  in  Figure  2  and  the  process  is  governed  by  the 
transition  kernel  matrix  in  Eq.(74) 


(0.5-e)X1e_Xlt 

(0.5-5e)X2e~X2t 

2eX3e~X3t 

4eX4e‘X4t 

0 

(l-9e)X2e"X2t 

6eX3e~X3t 

3eXe  X4t 
4 

P(t)  = 

0 

0 

0.4X3e“X3t 

0.6X.e_X4t 

4 

0 

0 

0 

X.e.  4 

4 

_ Q  *  1 

where  Xj=0.2,  X2=0.1.  ^3=0. 4,  X^=0.5,  e=2.5xl0  ,  (all  units  are  in  sec  ). 
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Figure  3:  State  transition  diagram  for  Case  2 
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4.2.1  Stationary  probability  distrubution  of  the  non-perturbed  semi-Markov 
process 

By  decomposing  the  transition  kernel  matrix,  the  transition  probability 
matrix  of  the  non-perturbed  imbedded  Markov  chain  is  as  follows: 


P  = 


0.5 

0 

0 

0 


0.5 

1 

0 

0 


0 

0 

0.4 

0 


0 

0 

0.6 

1 


(75) 


By  raising  the  transition  probability  matrix  to  successively  higher  powers 
until  stationarity  is  established,  the  stationary  probability  vector  of  the 
non-perturbed  imbedded  Markov  chain  in  classes  1  and  2  are  found  to  be: 

I(i)  =  C  0  1  ]  (76) 

I(M}  =  C  0  1  ]  (77) 

Hence,  the  stationary  probability  vectors  of  the  non-perturbed  semi-Markov 
process  are. 


=  [  0  1  ] 

(78) 

(2)  =  [  0  1  ] 

(79) 

4.2.2  Approximate  Markov  chain 

The  Laplace  transform  of  the  transition  kernel  for  transition  from  aggregated 
"state"  1  to  "state”  2  is  again  given  by  Eq.(20).  From  the  transition  kernel 
matrix  in  Eq.(73), 


q( 1 )  -  qO)  =  9 

q  1  q22  s 
Td>  - 
2 


p22  T22  =  09 


(80) 

(81) 


j 
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Substituting  the  above  quantities  and  Eq.(78)  into  Eq.(23)  gives 


Oxq(J)  +  9 
Oxr^  +  10 


=  0.9 


From  the  structure  of  the  model, 

p12  - 1  <«> 

So.  the  transition  kernel  element  for  transitions  from  aggregated  "state"  1  to 
2  in  the  new  time  scale  is, 


♦l2(s)  = 


s  +  0.9 


or,  in  the  scaled  time  domain, 

♦12(f)  =0.9e-°9  ** 

Because  there  are  only  two  classes,  therefore: 
e,  _  ,  ■,  -0.9  t ' 


„*(f)  =  e-”'3  C  (86) 

»®(f)  -  1  -  e-09  C'  (87) 

In  the  original  time  scale,  this  becomes 

w®(t)  =  e-0'9  X  25x10'6  c  (88) 

T®( t)  =  1  -  e"°*9  x  2-5xl0_6  c  (89) 

This  completes  the  derivation  of  the  approximate  results  for  this  model. 
Clearly,  it  is  rather  trivial  to  expand  the  class  probabilities  by  the 
stationary  probabilities  in  each  class.  The  results  are 

Fj(t)  S  0  (90) 

*2^)  ~  (91) 

ir3(t)  z  0  (92) 

*4(*)  ~  *£(0  (93) 
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4.2.3  Exact  solution  of  the  original  semi-Markov  process 
Exact  solutions  in  closed  form  of  the  state  probabilities  in  class  1  were 
evaluated  with  the  help  of  MACSYMA  similar  to  that  of  Case  1.  The  results  for 
this  model  are: 


ir.(t)  = 


5  -9  9999xl0~^  t  5  e-l -^XXXX)x10  c 

1.0039x10  e  s-ayyyxiu  c  -  1.00338x10  e 

+  3.33334xl0"6  e4-0xl°  *  +  7.50000X10-6  e~5-0xl°  1  t 


-  5.6449x10 


-9 


(94) 


=  1.54588xl0-12  e-5-0001^10  c  [6.49386xl016  cosh(4.99991xlcf  12t) 

-  6.49373xl016  sinh(4.99991xl0-12  t)]  -  1.00382xl05  e-9-99999x10  t 

-  1.24998xl0"6  e‘4-0x10  C  -  5.62498xl0~7  e-50xl°  t  -1 .47474xl0~4 


ir2(t) 


(95) 

The  total  probability  of  occupying  class  1  is  then  t)+ir2(2)  and  the 
normalized  probabilities  in  this  class  aare  simply  and  ir ^  normalized  by 
their  stun. 


4.2.4  Comparison  of  results 

The  total  probability  of  occupying  class  1  obtained  from  the  approximate 
Markov  chain  and  from  the  analytical  solution  of  the  original  semi-Markov 
process  are  compared  in  Table  3.  The  exact  and  approximate  solutions  listed  in 
the  table  agree  to  four  decimal  places,  except  after  one  million  seconds  have 
elapsed  where  the  error  occurs  in  the  fourth  decimal  place. 

The  exact  normalized  transient  probability  vector  within  class  1  is  shown  in 
Table  4.  It  can  be  seen  from  these  results  that  the  stationary  normalized 
probability  vector  agrees  with  the  stationary  probability  vector  of  the 
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non-perturbed  semi-Markov  process  in  class  1.  The  approximate  state 
probabilities  converges  to  the  exact  solution  within  0.0003  absolute  error 
after  t=100  seconds. 

This  example,  which  consists  of  two  non-ergodic  classes,  shows  that  the 
original  process  aggregated  transient  probability  vector  is  well  approximated 
by  the  approximate  Markov  chain.  Furthermore,  the  normalized  probability 
vector  in  class  1  converges  to  the  stationary  probability  vector  of  the 
non-perturbed  process  after  a  brief  transient  period.  Notice  that  this  example 
demonstrates  the  relaxation  of  the  ergodicity  condition  described  in  section 
3. 


Table  3 

Compaarison  of  approximate  and  exact  probability  in  class  1 


t/sec . 

approximate  class  1 
probability  ir®(t) 

exact  class  1 
probability  ir^(t) 

10 

0.99997 

0.99998 

100 

0.99976 

0.99978 

200 

0.99954 

0.99955 

500 

0.99886 

0.99888 

1000 

0.99774 

0.99775 

5000 

0.98880 

0.98881 

10000 

0.97773 

0.97775 

1000000 

0.10527 

0.10540 
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Table  4 

Normalized  probability  distribution  in  class  1 


t/sec . 

state  1 

state  2 

10 

0.55183 

0.44817 

100 

0.00027 

0.99973 

200 

0.00000 

1.00000 

500 

0.00000 

1.00000 

5  CONCLUSIONS  AND  CONTRIBUTIONS 

The  approximate  technique  presented  in  this  paper  can  be  used  to  quantify  the 
performance  of  those  fault-tolerant  systems  with  component  failure  rates  small 
relative  to  the  fault  detection  and  isolation  decision  rates.  This  paper  has 
shown  that  the  approximate  technique  can  be  a  practical  tool  to  simplify  the 
quantification  of  large  complex  fault-tolerant  system  performance  and  might 
also  be  an  efficient  tool  in  the  synthesis  of  such  system  designs. 

The  particular  contributions  of  this  paper  can  be  summarized  as  follows: 

1.  Korolyuk's  limit  Theorem  was  extended  by  generalizing  the  form  that 
the  transition  kernel  elements  may  take.  In  particular,  they  may 
depend  through  the  holding  time  distribution  on  a  time  scale  factor  5 
in  addition  to  depending  on  the  small  parameter  e  that  divides  the 
state  space  of  the  system  into  classes. 


2.  An  approximate  technique  based  on  this  extended  Theorem  was  then 
presented,  by  which  the  state  probability  vector  of  a  fault-tolerant 
system  semi-Markov  model  can  be  approximated  by  expanding  a  reduced 
order  Markov  chain  state  probabilities  by  the  stationary  probability 
vectors  of  the  non-perturbed  processes  within  the  disjoint  classes. 
The  direct  benefit  if  this  approximate  technique  is  a  large  reduction 
of  hte  computational  cost  of  generationg  results  for  large  models. 
Therefore,  models  of  large  complex  fault-tolerant  systems  become 
tractable. 

3.  An  extended  theorem  with  the  relaxation  of  the  ergodicity  condition 
stated  in  Korolyuk’s  original  work  was  also  presented  and  proved  in 
section  xx.  As  a  result,  the  approximate  technique  can  be  applied  to 
a  wider  scope  of  fault-tolerant  system  models,  including  those  with 
certain  types  of  non-ergodic  classes.  One  of  the  examples  in  section 
4  demonstrated  the  use  of  this  relaxed  condition. 
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Abstract 

A  property  observed  in  high  reliability  fault  tolerant  control  systems  is  the  relatively 
rare  occurrence  of  component  failures  compared  to  the  frequent  occurrence  of  redun¬ 
dancy  management  decision  events.  This  property  leads  to  a  temporal  decomposition 
of  the  semi-Markov  chain  reliability  model  into  two  time  scales:  a  slow  time  scale  for 
failure  events,  a  fast  time  scale  for  FDI  events.  Conditions  are  described  under  which  a 
perturbed  semi-Markov  chain  can  be  approximated  by  an  enlarged  Markov  process,  the 
parameters  of  which  are  derived  from  the  parameters  of  the  semi-Markov  chain. 


X  Introduction 

A  typical  fault-tolerant  control  system  (FTCS)  is  composed  of  many  highly  reliable  re¬ 
dundant  components  including  sensors,  actuators,  power  supplies  and  computers.  These 
components  are  networked  in  a  hierarchical  architecture,  and  their  use  is  governed  by  a 
redundancy  management  (RM)  policy.  Failure  detection  and  isolation  (FDI)  logic  is  imple¬ 
mented  to  indicate  to  the  RM  system  which  components  are  no  longer  safely  usable. 
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It  has  been  demonstrated  [1,2]  that  the  reliability  and  availability  of  an  FTCS  can 
be  computed  using  a  finite-state  generalized  Markov  (that  is,  Markov  or  semi-Markov) 
reliability  model.  These  calculations  are  often  difficult  or  impossible  to  accomplish  by 
classical  combinatorial  methods  due  to  time-ordered  event  sequences  that  are  a  consequence 
of  the  RM  policy  and  FDI  logic.  If  sequential  tests  are  used  to  detect  failures  [3],  then  a 
semi-Markov  chain  reliability  model  must  be  used  to  predict  the  system  reliability. 

Many  methods  exist  for  the  simplified  analysis  of  the  steady  state  behavior  of  generalized 
Markov  chain  models.  However,  generalized  Markov  chains  model  of  FTCS  invariably 
contain  one  or  more  trapping  states  that  represent  system  loss.  Thus,  the  steady  state 
behavior  is  of  no  interest  because  the  steady  state  condition  will  certainly  be  system  loss. 
It  is  the  transient  behavior  of  these  models  that  is  of  interest. 

A  generalized  Markov  chain  is  characterized  by  a  discrete  set  of  states  and  an  arbitrary 
distribution  of  the  holding  or  sojourn  time  for  each  transition.  The  semi-Markov  chain 
specializes  to  a  Markov  chain  when  the  holding  times  are  geometrically  distributed  and 
identically  distributed  for  all  transitions  exiting  a  particular  state. 

The  result  that  must  be  routinely  computed  in  analyzing  the  reliability  model  is  the 
interval  transition  probability,  n),  which  is  the  probability  that  the  model  occupies  state 
j  at  time  n  given  that  it  entered  state  i  at  the  initial  time.  For  FTCS,  the  states  represent 
a  complete  characterization  of  the  condition  of  the  system.  Thus,  if  sill  of  the  4>ij{n)  that 
correspond  to  system  loss  configurations  for  j  can  be  computed  for  n  corresponding  to 
the  finite  duration  of  the  mission,  then  the  probability  of  an  unsuccessful  mission  can  be 
computed. 

Once  the  interval  transition  probabilities  have  been  determined  for  a  particular  time 
n,  the  probability  of  occupying  each  state  can  be  determined  if  the  initial  state  occupancy 
probabilities  are  known.  Let  7r(n)  be  the  state  probability  distribution  at  time  m.  If  £.(0) 
is  known,  then 

x(n)  =  £(0)*(n)  (1) 

In  the  context  of  the  FTCS,  the  first  state  is  routinely  chosen  to  represent  the  situation  where 
all  components  are  working.  Usually,  the  system  occupies  the  first  state  with  probability 
one  at  the  initial  time. 
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The  interval  transition  probabilities  are  generated  by  the  semi-Markov  chain  recursion 
formula  [4]: 

*(n)  =  >W(n)  +  G(n)*(n  -  m);  IC  :  #(0)  =  I  (2) 

m=0 

Taking  z-transforms  of  both  sides  of  (2)  and  solving  for  $(z): 

*(z)  =  (I-G(z)]-1>W(z)  (3) 

The  z- transform  of  the  state  occupancy  probability  distribution  is 

£M  =  l(0)[I-G(*)|-»W(r)  (4) 

which  follows  directly  from  (2).  The  inverse  matrix  of  [I  -  G(z)j  always  exists  for  a  semi- 
Markov  chain.  The  inverse  transform  of  either  &(z)  or  9(z)  can  be  found  using  standard 
partial  fraction  expansion  techniques.  However,  for  all  but  the  simplest  of  situations,  trans¬ 
form  methods  are  useless  in  a  practical  sense. 

In  practice,  the  interval  transition  probability  matrix  is  nearly  always  found  by  per¬ 
forming  the  semi-Markov  recursion  numerically.  For  a  model  with  N  states,  computation 
of  $(n)  requires  storage  of  2 nN2  values  because  both  $(n)  and  G(n)  must  be  stored  for 
all  times  prior  to  and  including  time  n.  A  reliability  model  for  a  typical  inertial  navigation 
system  might  have  twenty  states,  a  sampling  period  of  200ms,  and  a  two  hour  mission  time. 
This  would  require  storage  of  2.88  x  107  single  precision  values  and  require  230  megabytes  of 
storage.  Moreover,  the  number  of  floating  point  multiplications  required  to  compute  z(n) 
from  x(0)  is  about  n2JV3  -  which  is  2.59  x  1011  for  the  example  described  above.  Thus,  the 
computational  burden  and  memory  requirements  are  tremendous  even  for  a  simple  system. 

The  problem  to  be  addressed  in  this  paper  is  to  substantially  reduce  the  computational 
burden  while  preserving  the  accuracy  of  reliability  and  availability  calculations. 

One  possible  means  for  doing  this  is  direct  Monte  Carlo  techniques.  If  a  sufficient 
number  of  Monte  Carlo  simulations  are  made  of  system  operations  to  account  correctly  for 
all  possible  random  events  that  bear  on  the  reliability  calculation,  then  any  aspect  of  system 
performance  can  be  evaluated.  To  obtain  meaningful  results  for  high  reliability  systems 
with  events  that  occur  with  probabilities  as  low  as  1  x  10-9  (typical  of  the  probability  of  a 
component  failure  over  a  single  time  step),  over  one  billion  simulations  must  be  performed. 
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This  task  is  as  formidable  as  evaluating  the  semi-Markov  chain  recursion  for  large  values  of 
the  time  index.  Consequently,  reliability  calculations  via  direct  Monte  Carlo  methods  also 
have  prohibitive  computational  costs. 

Lewis  suggested  in  [5,6]  that  a  modified  Monte  Carlo  approach  be  used  for  high  reliability 
systems.  Again,  failure  events  are  assumed  to  be  extremely  rare  relative  to  other  events 
that  occur  in  the  system.  Thus,  the  vast  majority  of  simulations  will  be  those  for  which 
no  failures  occur.  Lewis  assumes  that  all  events  have  exponentially  distributed  times  of 
occurrence  and  can  be  modeled  by  a  Markov  chain.  It  is  possible  to  sample  the  failure 
distributions  before  a  simulation  is  initiated  to  determine  if  any  failures  will  occur  during 
the  mission.  If  all  failures  occur  after  the  mission  has  been  completed  (which  is  usually 
the  case),  then  a  normal  simulation  results.  If  a  failure  occurs  during  the  mission,  then 
the  complete  simulation  must  be  performed  including  FDI  decisions,  decision  errors,  and 
repairs.  However,  this  approach  does  not  apply  to  semi-Markov  chains  because  FDI  events 
arising  from  a  sequential  FDI  test  are  not  exponentially  distributed.  In  these  cases,  a 
complete  simulation  must  always  be  run  and  no  benefits  are  derived  from  the  modified 
technique. 

Another  approach  that  exploits  the  rare  occurrence  of  failure  events  is  suggested  by 
Trivedi  in  [7,8].  The  model  is  based  upon  a  time-scale  decomposition  of  the  system  into  vir¬ 
tually  disjoint  fault-occurrence  and  fault  handling  submodels.  The  fault-handling  submod¬ 
els  represent  aggregated  states  and  the  failure  occurrence  submodels  dictate  the  behavior 
between  these  aggregated  states.  The  reliability  of  the  system  predicted  by  the  aggregated 
model  is  then  computed  using  Markov  or  Monte  Carlo  techniques.  However,  the  only  fault¬ 
handling  events  that  are  accounted  for  are  detections  and  missed  detections  following  actual 
faults.  A  common  FDI  event  that  cannot  be  treated  by  these  hybrid  models  is  the  false 
alarm,  which  occurs  in  the  absence  of  a  fault.  Therefore,  this  approach  is  limited  to  systems 
where  false  alarms  cannot  occur. 

In  this  paper,  the  relatively  rare  occurrence  of  component  failures  relative  to  RM  decision 
events  will  be  exploited  in  the  development  of  an  approximate  method  for  evaluating  semi- 
Markov  chain  reliability  models  of  fault  tolerant  control  systems. 
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2  A  Limit  Theorem  for  Semi-Markov  Chains 


Theorem  1  describee  how  a  perturbed  semi-Markov  chain,  which  is  dependent  on  a  small 
parameter  e  in  a  certain  way,  can  be  described  asymptotically  by  an  enlarged  Markov  process 
as  e  — *•  0.  This  theorem  is  an  extension  of  the  results  for  discrete  parameter  semi-Markov 
processes  stated  in  [9]. 

The  semi-Markov  chain  depends  on  a  small  parameter  e  such  that  the  entire  state 
space  of  the  semi-Markov  chain  can  be  decomposed  into  disjoint  classes  of  states  where  the 
probabilities  of  departure  from  each  class  tend  to  zero  with  e.  Also,  the  total  sojourn  in  each 
class  is  assumed  to  have  a  non-degenerate  distribution  in  the  limit  as  e  — ♦  0.  (When  e  =  0, 
the  chain  will  be  referred  to  as  the  unperturbed  semi-Markov  chain  while  the  e-dependent 
chain  will  be  referred  to  as  the  perturbed  semi-Markov  chain.) 


Theorem  1  (Limit  Theorem  for  Semi-Markov  Chains)  Let  the  set  E  of  states  of  the 
semi-Markov  chain  be  expressible  as  a  union  of  disjoint  classes: 

N‘ 

£=£**  (5) 

fas  1 

Let  rj£)  be  the  sojourn  of  the  semi-Markov  chain  in  class  Ek  when  it  starts  from  state 
i  €  Ek  and  moves  to  class  Er  where  r  k.  If  the  following  two  conditions  hold  for  the 
semi-Markov  chain  E: 

1.  The  elements  of  the  core  matrix  sequence  (gfj(n)  |  i,j  €  E}  specifying  the  semi- 
Markov  chain  depend  as  follows  on  the  small  parameter  e: 


%(")  =  p’r^i  (;) 


(6) 


where  0)  =  0.  The  p‘;  can  be  expanded  in  a  Taylor  series  about  e=0.  Retaining  terms 
that  are  linear  in  e; 


if  i,j  e  Ek 

eqjV  +  O(e)  if  i  e  Ek  and  j  <£  Ek 


(7) 


The  embedded  Markov  chain  for  e=0  obeys  the  usual  Markov  chain  properties: 


£  Pij]  —  1;  and  p,-*’  €  [0,  l|;  VfceM  (8) 

i€Ek 
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2.  The  embedded  Markov  chain*  defined  by  the  matrices  pj^  )  i,j  €  EkV  k  E  A/}  are  er- 
godic  with  stationary  distributions  1 1  €  Ek  Vk  €  M  . 

Then: 


Jim  Pr  {r*  <  *}  =  7tr  { 1  -  exp  [-— ^]  } 

(9) 

where: 

7ir  = 

(10) 

A*  S 

*lkMk) 

(11) 

Here: 

t!*'1 

III 

M 

(12) 

)€£, 

-  EtS? 

(13) 

i€Ek 

III 

M 

jxa 

V*’, 

(14) 

j€E/t 

=  £*Mrt) 

n»  0 

(15) 

PROOF:  Let  eg denote  the  integer  valued  sojourn  of  the  semi-Markov  chain  in  state 
i  with  next  transition  to  state  j  with  the  holding  time  distribution  -/^-(n/e)  while  the  6^ 


are  the  transition  indicators  from  state  t  to  state  j.  The  probability  distribution  of  the 
random  quantities  rj£  can  be  expressed  in  terms  of  total  probability  as 

Pr  {r^  <  n}  =  £  Pr  {$£  =  1;  eft;  +  r£}  <  n}  +  £  Pr  {#,-  =  1;  eft,-  <  n}  (16) 

j€£*  jGEr 

Defining  the  interval  transition  CDF  as 

=  (IT) 

then 

H  II  -4;r(n  ~  m)  +  Z)  "3‘y(n)  (18) 

ieEk  m=0  jGEr 

Taking  ^-transforms  of  both  sides  yields: 

-4h‘)  =  E  s'>(*) s  *£'(»>  +  (  rfr  }  E  »«W-  ;»») 
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The  z-transforms  of  the  g*j(z)  must  be  evaluated  to  first  order  in  e.  From  (6)  and  the 
definition  of  the  z-transform  [10]: 


9ij(z)  =  Pij  (7)  *  "  (20) 

n=0  V  / 

Note  that  p‘y  has  been  moved  in  front  of  the  summation  sign  because  it  does  not  depend 
on  time.  Let  m  =  n/c  and  expand  z~tm  in  a  Taylor  series  about  €=0.  Then: 

00 

9ij(z)  -  Pi}  H  {!  -  lo8  * }  Mm)  +  °(e)  (21) 

m=0 

where  0(e)  represents  terms  such  that  in  the  limit  as  e  — >  0,  the  quantity  0(e)/e  approaches 
zero.  Noting  that: 

f >;(")  =  1  (22) 

n=0 

00 

n  h>}(n)  =  *»/•  (23) 

a=0 

and  substituting  p*y  from  (7)  and  combining  terms  of  0(e)  yields: 

*;,.(«)  =  {  {1  - e,<)  l°“>  -  e,«)o(e)  “ iJ  E  Et  (24) 

e<7,-}- '  +  0(e)  if  1  e  Ek  and  j  Ek 

Incorporating  these  results  into  (19)  and  placing  all  terms  proportional  to  e  on  the  RHS: 

■  (z)  -  Pi?  -  (*)  =  *i?  (*)  { ?!,k)  +  P,-f  fH  log  z) 

i&Ek  }€Ek 

+  e{^}  E  1<?+0(e)  (25) 

Now,  passing  to  the  limit  as  e  — ►  0,  the  RHS  varnishes  and  the  -^*r(z)  are  found  to  satisfy 
the  system  of  equations  below: 

S*£(«)  -  £  $,S*£?W  =  0  (26) 

}<=Ek 

Let  P*  =  [p.-y^j  represent  the  embedded  Markov  chain  operator  in  class  Ek  of  the  unper¬ 
turbed  semi-Markov  chain  E.  The  system  of  equations  in  (26)  can  be  expressed  aa: 
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(27) 


After  successive  premultiplication  by  P*,  and  taking  the  limit  as  n  — ►  oo: 


s^wT={£a,pj}  s**wT 

Under  Condition  2,  the  ergodic  theorem  for  Markov  chains  [11]  implies  that: 

-(*) 

lim  PJ  =  PJ°  = 

n— *oo 

,(*> 

so  that  the  solution  to  (27)  is  independent  of  the  superscript: 

-${*)  =  -tkr (*)  V «  €  Ek,  V  Jb  €  M 


(28) 


(29) 


(30) 


Now,  (25)  is  of  the  form  f(x)  =  g(x,e),  that  is,  the  LHS  is  not  a  function  of  e  and 
is  therefore  constant  with  respect  to  e.  However,  as  e  — *0,  the  RHS  approaches  zero  so 
that  the  LHS  must  be  zero  for  all  values  of  e.  Canceling  e  from  the  result,  multiplying 
by  the  stationary  probabilities  of  the  unperturbed  semi-Markov  chain  in  class  k,  irjk\  and 
summing  over  i  €  Ek  yields: 


£  'I*1  £  +^1°*'}  +  {-M  E +o(«)  (31) 

j€Ek  i&Ek  }&Er 


On  passing  again  to  the  limit  as  e  — >0,  noting  that  all  of  the  (z)  have  the  limit 
function  -^r(z),  and  solving  for  -<j>kr(z),  the  z-transform  of  the  class-to-class  transition 
PMF  becomes: 

**'w  =  7*'a‘i0!*1+a>  (32) 

The  mapping  from  the  z  domain  to  the  s  domain  (Laplace)  is  given  by  3  =  (log  z)/T.  Divid¬ 
ing  top  and  bottom  by  the  sampling  period  T,  and  applying  the  transformation  concludes 
the  proof.  □ 

In  summary,  Theorem  1  describes  the  conditions  under  which  a  perturbed  semi-Markov 
chain  can  be  approximated  by  an  enlarged  Markov  process  that  evolves  in  the  slow  time- 
scale,  and  also  states  how  the  parameters  of  the  Markov  process  are  determined  from  the 
parameters  of  the  semi-Markov  chain.  In  the  context  of  FTCS,  the  fast  time  scale  behavior 
within  a  class  would  represent  FDI  decision  and  RM  events  while  the  slower  class-to-class 


behavior  would  represent  the  occurrence  of  failures.  The  class-to-class  interval  transition 
CDF  $kr(t)  that  results  is  a  continuous  time  envelope  of  the  behavior  between  the  classes. 
This  interpretation  is  intuitively  satisfying  since  failures  are  invariably  assumed  to  have 
exponentially  distributed  times  of  occurrence  over  continuous  time. 

However,  two  problems  occur  in  the  application  of  Theorem  1  to  FTCS  models:  (1)  the 
embedded  Markov  chains  for  each  class  of  the  unperturbed  model  are  rarely  ergodic,  and 
(2)  the  holding  time  PMFs  are  usually  functions  of  n,  not  n/c,  that  is,  the  holding  times  are 
typically  not  on  the  order  of  the  mean  time  to  a  component  failure.  The  requirement  that  the 
embedded  Markov  chains  of  the  unperturbed  classes  be  ergodic  is  important  in  producing 
(26)  and  guarantees  the  existence  of  the  stationary  probabilities  |  »  S  Ek  Vfc  €  Afj. 
The  ergodicity  condition  can  be  relaxed  in  much  the  same  way  as  was  done  in  [12]  for 
semi-Markov  processes.  This  will  be  accomplished  in  Lemma  2  and  Lemma  3.  The  second 
problem  can  be  mitigated  by  introducing  time-scaling  into  Theorem  1,  as  will  be  done  in 
Theorem  4. 

3  Relaxation  of  the  Ergodicity  Condition 

Lemma  2  discusses  how  the  existence  of  the  Caesaro  limit  of  the  embedded  Markov  chain 
operator  leads  to  a  relaxation  of  the  ergodicity  condition. 

Lemma  2  Consider  a  semi-Markov  chain  state  space  E  that  can  be  expressed  as  a  sum  of 
disjoint  classes  according  to  (5)  and  (7).  Let  P*  =  represent  the  embedded  Markov 

chain  operator  for  class  Ek-  The  solution  of  (26)  is  independent  of  the  superscript  (and  the 
results  of  Theorem  1  hold),  if  the  Caesaro  limit  exists: 


PROOF:  The  system  of  equations  in  (26)  can  be  expressed  in  matrix  form  as  is  done 
in  (27).  Successively  premultiplying  both  sides  by  P*,'  and  averaging  an  infinite  number  of 
these  terms: 

-<M*)T  =  \  Y,Pk~4>kr{z)T  (34) 

n-oo  n  J=I 
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Because  the  operator  Pk  satisfies  the  Caesaro  limit  from  (33),  the  solution  of  (26)  is  inde¬ 
pendent  of  the  superscript.  □ 

The  relaxation  due  to  Lemma  2  demonstrates  that  the  ergodicity  condition  of  Theorem 
1  was  sufficient,  but  not  necessary.  Thus,  the  conditions  under  which  the  Caesaro  limit 
exists  should  be  determined  in  hopes  of  finding  a  necessary  condition. 

Lemma  3  Consider  a  semi-Markov  chain  state  space  E  that  can  be  expressed  as  a  sum  of 
disjoint  classes  according  to  (5)  and  (7).  Let  P *  =  represent  the  embedded  Markov 

chain  operator  of  the  unperturbed  chain  for  class  Ek.  If  the  embedded  Markov  chain  rep¬ 
resented  by  the  operator  P*  is:  1)  ergodic,  or  2)  non-ergodic  with  one  and  only  one  unit 
eigenvalue,  then  the  Caesaro  limit  in  (S 4 )  exists. 

Proof:  The  proof  of  this  lemma  is  essentially  similar  to  that  in  [12],  For  details  of  this 
proof,  see  [13].  □ 

4  Limit  Theorem  with  Time  Scaling 

In  FTCS  with  small  single  step  component  failure  probabilities,  the  holding  time  PMFs 
associated  with  the  core  matrix  sequence  elements  do  not  depend  on  e  but  only  on  the 
FDI  decision  delay.  If  a  semi-Markov  chain  is  observed  in  another  time  scale  that  is  1/6 
times  that  of  the  original  time  scale,  then  the  PMF  h,y(n)  will  be  affected  but  the  eventual 
transition  probabilities,  p*y,  will  remain  the  same  because  they  characterize  the  transition 
probability  from  state  i  to  state  j  regardless  of  when  the  transition  takes  place.  However, 
the  holding  time  PMFs  in  the  new  time  scale  are  not  obtained  by  simply  changing  the 
argument  of  A,y(-)  from  n  to  n/6.  This  is  because  the  summation  of  hij(n/6)  for  all  non¬ 
negative  values  of  the  time  index  would  not  be  unity  and  so  would  not  yield  a  proper  holding 
time  function.  The  CDF  -h,y(n)  associated  with  the  PMF  A,y(n)  must  be  determined  and 
the  argument  of  the  CDF  replaced  by  n/6.  The  new  PMF  hjy(n)  observed  in  the  new 
time  scale  would  have  most  of  its  probability  mass  close  to  the  origin.  The  statistics  of  the 
process  in  the  new  time  scale  will  depend  on  the  small  parameter  6  -  the  time  scaling  factor. 

Theorem  4  (Limit  Theorem  With  Time  Scaling)  Let  the  set  E  of  states  of  the  semi- 
Markov  chain  be  expressible  as  a  sum  of  disjoint  classes  as  in  (5).  Let  rj£)  be  the  sojourn 
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of  the  semi-Markov  chain  in  class  Ek  when  it  starts  from  state  i  €  Ek  and  moves  to  class 
Er  for  r  #  k.  If  the  following  two  conditions  hold  for  the  semi-Markov  chain  E: 

1.  The  elements  of  the  core  matrix  sequence  {$/,(»)  \  i,j  €  E}  specifying  the  semi- 
Markov  chain  depend  as  follows  on  the  small  parameters  8  and  e: 


(35) 


Here,  -hij(-)  is  the  transition  CDF  of  the  semi-  Markov  chain  in  the  original  time  scale  and 
-hij(0)  =  0.  The  p\ '•  can  be  expanded  in  a  Taylor  series  about  e=0  as  in  (7).  The  embedded 
Markov  chain  obeys  the  usual  Markov  chain  properties  described  in  (8). 

2.  The  embedded  Markov  chains  defined  by  the  matrices  jpt-*^  |  i,j  €  E^ik  €  M  j  are 
ergodic  or  non-ergodic  with  one  and  only  one  unit  eigenvalue  with  the  stationary  probabilities 
(in  the  Caesaro  limit  sense)  |  *  €  €  M J. 

Then: 


Um  Pr  {rk,  <  t'}  =  ^  {l  -  exp  [^]  } 


(36) 


where  the  parameters  of  the  enlarged  Markov  process  were  defined  in  Theorem  1  and  a  =  8/e. 


PROOF:  The  proof  of  this  theorem  is  essentially  identical  to  that  of  Theorem  1.  For 
details  of  this  proof,  see  [13].  □ 

It  should  be  noted  that  an  explicit  analytical  expression  of  the  core  matrix  sequence, 
G*(n),  is  not  required  to  expand  the  eventual  transition  probabilities  of  the  perturbed 
semi-Markov  chain,  p‘;-  in  a  Taylor  series  about  e=0.  The  eventual  transition  probabilities 
may  be  evaluated  numerically,  which  is  what  would  be  done  in  practice.  This  is  fortunate 
because  the  direct  form  of  the  core  matrix  is  not  always  available  [3].  In  many  cases,  the 
decision  time  PMFs  are  tabulated  numerically  and  no  functional  form  is  available. 

Also,  the  time  scale  decomposition  of  the  semi-Markov  chain  is  crucial  to  the  use  of  this 
technique.  A  simple  way  of  characterizing  each  class  is  as  follows:  the  first  class  contains 
states  for  which  no  failures  have  occurred,  the  second  class  contains  states  for  which  a  single 
failure  has  occurred,  the  third  class  contains  states  for  which  two  failures  have  occurred,  etc. 
These  classes  arise  by  setting  f=0  and  observing  which  groups  of  states  of  the  unperturbed 
semi-Markov  chain  do  not  communicate. 


11 


Finally,  estimates  of  the  original  semi-Markov  chain  state  probabilities  can  be  recovered 
from  the  enlarged  Markov  process.  The  asymptotic  behavior  of  the  unperturbed  semi- 
Markov  chains  in  each  class  are  the  stationary  probabilities  (or  Caesaro  limit  probabilities) 
for  that  class.  The  class-to-class  behavior  is  determined  by  the  enlarged  process.  The 
approximate  state  probabilities  in  each  class  are: 

xj*)(n)  =  (n)  (37) 

where  the  approximate  class  probabilities  of  the  enlarged  process  are  found  from  its  interval 
transition  probability  matrix. 

5  Performance  Evaluation  of  the  SCMS 

Two  simple  semi-Markov  reliability  models  of  a  single  component  monitoring  system  (SCMS) 
will  be  developed.  The  SCMS  uses  a  sequential  FDI  test  to  monitor  the  status  (failed  or 
working)  of  a  single  component.  The  two  models  will  differ  in  monitoring  policy.  The  first 
example,  SCMS-I,  models  an  FDI  test  that  operates  continuously  over  the  entire  mission 
duration.  The  second  example,  SCMS-II,  models  an  FDI  test  that  is  discontinued  after  the 
first  failure  indication  (namely,  abbreviated  monitoring). 

In  this  section,  the  performance  of  the  SCMS  will  be  evaluated  through  application  of 
the  approximate  method  to  a  semi-Markov  model.  The  procedure  follows:  (1)  semi-Markov 
transition  diagrams  are  constructed  describing  all  of  the  random  events  that  can  take  place, 
(2)  the  direct  form  of  the  core  matrix  sequence  is  derived,  (3)  the  core  matrix  is  placed  in 
standard  form,  (4)  the  performance  is  evaluated  through  application  of  Theorem  4. 

In  addition,  z-transforms  will  be  used  to  determine  an  analytical  expression  for  the 
state  and  class  occupancy  probabilities,  x(n)  and  xe(n)  respectively.  The  results  of  the  z- 
transform  analysis  will  be  used  to  evaluate  the  accuracy  of  the  approximate  method.  This 
is  possible  here  because  the  models  are  relatively  simple.  In  more  general  cases,  this  would 
not  be  practical. 


12 


i 


Figure  1:  Semi-Markov  transition  diagram  for  SCMS-I 


5.1  SCMS  with  continuous  monitoring 

Table  1  enumerates  and  defines  the  states  of  a  semi-Markov  chain  reliability  model  of  the 
SCMS-I.  The  dashed  line  in  the  table  distinguishes  the  class  decomposition  of  the  model: 
class  I  contains  states  1  and  2,  class  2  contains  only  state  3. 

The  semi-Markov  transition  diagram  for  the  SCMS-I  is  presented  in  Figure  1.  Two 
aspects  of  this  diagram  should  be  noted.  Given  that  the  chain  has  entered  a  state,  the  lines 
directed  out  of  that  state  represent  transitions  after  the  chain  has  remained  in  that  state  for 
a  period  of  time,  namely,  the  holding  time.  Secondly,  the  dashed  lines  represent  transitions 
whose  transition  PMFs  are  proportional  to  e.  Thus,  a  dashed  line  represents  the  condition 
that  no  such  transition  occurs  when  e  =  0.  This  is  a  convenient  way  of  depicting  the  class 
decomposition  of  a  semi-Markov  chain  reliability  model. 
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A  complete  statistical  description  of  the  sequential  test  used  in  the  FDI  process  requires 
knowledge  of  the  conditional  PMFs  of  the  time  to  decision  of  the  test.  The  following  two 
functions  are  required: 

/q  (n)  PMF  of  time  to  a  decision  that  no  failure  is  present  when  no  failure  is  present. 

fo(n)  PMF  of  time  to  a  failure  indication  when  no  failure  is  presen'  (false  alarm). 

In  these  PMFs,  the  fault  monitoring  event  at  time  n  must  be  conditioned  on  the  failure 
events  that  take  place  prior  to  and  including  time  n  -  1.  Thus,  it  is  assumed  that  there  is 
a  delay  of  at  least  a  single  time  step  between  when  a  failure  takes  place  and  when  it  can  be 
detected. 

Another  necessary  function  is  the  sum  of  all  probabilities  of  all  possible  test  outcomes 
•  nominal  decision,  failure  indication,  and  decision  not  yet  available  -  at  a  given  time  n. 
Itcan  be  specified  in  terms  of  the  decision  time  PMFs  as: 

9o(n)  =  i- !;{/£(*)  +  /£(*)};  »>1  (38) 

*=1 

Note  that  Qo(n)  is  defined  only  for  positive  values  of  the  time  index  n  and  is  defined  to  be 
zero  for  n  =  0.  Thus,  one  of  the  necessary  criteria  for  a  permissible  holding  time  PMF  is 
maintained  -  there  is  no  probability  mass  at  the  initial  time. 

The  core  matrix  sequence,  G*(n),  for  SCMS-I  can  be  expressed  in  matrix  form  as: 

(1  -*)"/»  (l-e)"/S(»)  <(1  -  r‘9o(»)  ‘ 

G*(n)  =  (1  -  e)nf^(n)  (1  -  e)"/r>(n)  C(1  ~  «)n_1Qo(n)  (39) 

0  0  6{n-l) 

Any  reasonable  PMF  may  be  used  for  the  decision  time  PMFs.  However,  a  closed  form 
solution  for  z(n)  is  desired.  A  simple  but  realistic  choice  for  the  decision  time  PMFs  is  the 
hypergeometric  PMF  [13].  This  PMF  is  a  good  approximation  to  the  holding  time  behavior 
of  many  sequential  tests,  as  demonstrated  by  Table  6.6  of  [3].  Choosing  an  appropriate 
eventual  transition  probability  yields  the  hypergeometric  decision  time  PMFs  below: 


tin  =  4  (<■"  -  n  ; 

At>  ~  Fta)  (a  -  6) 

(40) 

fU«)  =  AaD  («"  -  6")  ; 

l0  _  p  (1  -c)(l  -  d) 

D  /a  (c  -  d) 

(41) 
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where  0  <  b  <  a  <  1  and  0  <  d  <  c  <  1.  The  parameter  P/a  is  the  eventual  false  alarm 
probability  of  the  sequential  test.  The  core  matrix  can  now  be  expressed  in  terms  of  these 
PMFs. 

A  z-transform  analysis  of  the  semi-Markov  recursion  formula  using  the  above  core  matrix 


sequence  yields  the  state  occupancy  probability  vector  x(n): 

*l(»)  =  (1  -Pfa)Rn  +  PfaC-^^(cR)n-Pfa{j^(dR)n  (42) 

**(»)  =  Pfar-PiSj^^icRr  +  P/^j^idRr  (43) 

*3(n)  =  1  -  Rn  (44) 

and  the  class  occupancy  probability  vector  x4(n): 

£'(n)  =  [(1  -  «)M  -  (1  -  01  («) 


Availability  of  these  analytical  results  permits  comparisons  to  be  made  with  the  approximate 
results  that  exploit  the  class  decomposition  to  be  described  below.  It  should  be  emphasized 
again  that  the  existence  of  analytical  is  rare,  and  occurs  only  because  the  system  is  very 
simple. 

In  order  to  derive  the  enlarged  Markov  process  for  this  model,  G*(n)  must  be  placed 
in  standard  form.  For  an  in-class  transition,  the  decomposition  is  obtained  from  the  first 
two  terms  of  the  Taylor  series  expansion  of  the  eventual  transition  probability  about  6  =  0. 
In  addition,  the  mean  waiting  times,  must  be  derived.  For  an  out-  of-class  transition, 
the  decomposition  is  obtained  by  dividing  the  eventual  transition  probability  by  e  and  then 
taking  the  zeroth  order  term  in  the  Taylor  series  expansion  about  6  =  0. 

Consider  an  in-class  transition  from  state  1  to  state  1.  First,  the  eventual  transition 
probability  is  found: 

P*u  =  A*>  {(1  -  aR)  ~  (1  -  bR) }  ^ 

The  decomposition  for  the  transition  is: 

Pu  =  1  -  P/a  +  e(l  -  P/<*)^  i  1  j,)  (47) 
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To  satisfy  the  requirements  for  a  permissible  holding  time  function,  the  holding  time  func¬ 
tion  for  this  transition  must  be  expressed  as: 


(1  -  ai?)(l  -  bR) 

(a-6)/2 


{(<**)"  "  W> 


(48) 


From  (15),  the  mean  holding  time  can  be  found: 

,  (i  ~  q**a) 

11  (1  -  aiJ)(l  -  bR) 

Thus,  all  of  the  parameters  required  to  place  this  in-class  transition  PMF  in  standard  form 
have  been  derived. 

A  second  type  of  core  matrix  element  that  must  be  placed  in  standard  form  is  one 
corresponding  to  an  out-of-class  transition  such  as  a  transition  from  state  1  to  state  3. 
First,  the  eventual  transition  probability  must  be  found. 


,  „  _  ,  (l-o bR)  _  (1  -  cdR) 

Ps  1  Pfa *  (1  -  af2)(l  -  bR)  +  ePf*  (1  -  cR){  1  -  dR) 


(50) 


The  sole  parameter  required  for  the  approximation  technique  from  this  eventual  transition 
probability  is  found  from: 


731  = 


iipi' 


-n-P, )  (>-«*)  +  P,  I1-**) 

t_Q  ~  l1  ^(1  _  -Ul  _  +  y/*i 


(51) 


o)(l-6)  '  */a(l-c)(l-d) 

The  eventual  transition  probabilities  of  each  row  of  G((n)  sum  to  unity.  Thus,  this  is  a 
proper  semi-Markov  chain  [4]. 

The  next  step  in  the  procedure  is  to  determine  the  eventual  transition  probability  matrix 
of  the  unperturbed  semi-Markov  chain.  This  is  found  by  setting  e  =  0  and  ignoring  all  time 
varying  terms  in  the  core  matrix: 


P  = 


1  ~Pfa  Pfa  0 
1  ~Pfa  Pfa  0 
0  0  1 


(52) 


By  raising  P  to  successively  higher  powers,  the  stationary  interval  transition  probability 
matrix  is  found  to  be  identical  to  (52).  The  embedded  stationary  probability  distribution 
in  partitioned  form  is  thus: 

~  [  1  -  Pfa  Pfa  I  1  ] 
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(53) 
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With  knowledge  of  this  and  of  the  mean  holding  times  for  transitions  from  state  t  to  j, 
fij}  it  is  possible  to  determine  the  stationary  probability  distribution  of  the  unperturbed 
semi-Markov  chain,  This  probability  distribution  is  needed  to  approximate  the  state 
probability  distribution  of  the  original  perturbed  semi-Markov  chain. 

From  semi-Markov  theory,  the  stationary  probability  distribution  for  each  unperturbed 
class  Ek  is  given  by 

*\k)  =  *$Ak)/r(k)  (54) 

where  is  the  mean  waiting  time  of  the  chain  in  class  Ek ■ 


'=E 

i€Ek 

was  determined  above,  and  is  the  mean  holding  time  in  state  t: 

<■<*>  =  L 

}€Ek 


(55) 


(56) 


where  is  determined  from  the  limit  of  fy,  defined  in  (15),  as  e  — ►  0. 

The  stationary  probability  distribution  of  the  unperturbed  semi-Markov  chain  will  now 
be  determined.  The  mean  holding  times  of  the  unperturbed  semi-Markov  chain  in  the  first 
class  are: 

f  (i)  _  f  (1)  _  U  ~  ab)  f57) 

ru  -  ru  -  f i  i\ 


'u 


(1  -  a)(l  -  6) 

-(»>  _  f(i)  _  i1  ~  cd ) 

21  ~  22  “  (1  -  e)(l  -  d) 


The  mean  holding  time  in  class  1  starting  from  state  i  is  thus 

f(D  _  (1  _  P.  )  ( 1  ~  °6)  +  p.  ~  cd ) 

i  [  /aJ(l-o)(l-6)+  /a(l-c)(l-d) 

Similarly,  fj1^  =  The  mean  waiting  time  of  the  semi-Markov  chain  in  class  1  is: 


(58) 


(59) 


f(1)  =  E  =  fx1}  (6°) 

*€Ei 

Hence,  for  this  situation  (but  not  in  general):  x  = 

The  time  scale  factor  6  is  set  equal  to  e  for  convenience.  It  should  be  noted  that  6  must 
be  of  the  same  order  as  «,  but  not  necessarily  equal. 
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All  parameters  required  to  describe  the  enlarged  Markov  process  have  now  been  stated. 
The  parameters  of  the  approximate  class-to-class  interval  transition  CDF  can  be  found  as 
described  in  Theorem  4:  731  =  1,  Ax  =  So,  the  class-to-class  interval  transition 

CDF  expressed  in  the  slow  time  scale  is: 

=  1  -  exp  }  (61) 


To  return  to  the  original  time  scale,  let  t1  =  St,  and  recall  that  S  was  chosen  to  be  equal 
to  e  in  this  case.  The  rows  of  the  interval  transition  probability  matrix  of  the  enlarged 
process  must  sum  to  unity.  Since  the  semi-Markov  chain  is  always  in  state  1  at  the  initial 
time,  the  enlarged  process  is  always  in  class  1  at  the  initial  time.  Hence,  approximate  class 
occupancy  probabilities  can  be  stated  directly  from  the  first  row  of  the  interval  transition 
probability  matrix  since  s.e(t)  =  £*(0)-£(t): 


m  =  [, 


1 ,  /  eAi‘\l 

[exp(-  — } 

1  }j 

(62) 


By  expanding  the  approximate  Markov  process  in  terms  of  the  stationary  probabilities 
of  the  unperturbed  semi-Markov  chain  as  in  (48),  approximate  expressions  for  the  state 
occupancy  probabilities  of  the  original  process  can  be  stated  as  follows: 


m  = 


/  fAl*  \  D  J 

f  <Ait\ 

l  T  J 

—  }J 

(63) 


The  approximate  expressions  above  will  be  compared  to  the  analytical  expressions  de¬ 
rived  using  z- transform  techniques. 


5.2  Discussion  of  Results  for  SCMS-I 

This  section  examines  sources  of  error  associated  with  the  approximate  technique  for  a 
specific  set  of  system  parameters:  o=0.95,  6=0.94,  c=0.89,  d= 0.88  and  P/a=0.05.  This  set 
of  parameters  implies  a  time  to  detection  in  the  absence  of  a  failure  of  16  time  steps  (3.2 
seconds),  and  a  time  to  a  nominal  decision  in  the  absence  of  a  failure  of  36  time  steps  (7.2 
seconds)  for  a  sample  period  of  200  milliseconds. 

The  relative  error  (in  percent),  A,  =|  ff,(n)  -  x,(n)  |  /x,(n)  will  be  used  to  compare  the 
approximate  and  the  analytical  state  occupancy  probabilities. 
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The  approximate  state  probability  time  histories,  &(n),  are  compared  to  those  obtained 
analytically,  £(n),  in  Figure  2  for  each  of  the  three  states.  These  results  are  for  c=0.00005, 
implying  an  MTBF  of  20,000  time  steps  (4000  seconds  or  just  over  an  hour).  In  this  figure, 
the  state  probabilities  are  propagated  for  a  period  of  one  component  MTBF.  Time  is 
normalized  by  the  MTBF. 

The  largest  error  occurs  early,  especially  in  the  first  class.  This  is  due  to  the  fact  that 
the  normalized  state  probabilities  in  class  1  have  not  converged  to  the  class  1  stationary 
probabilities  of  the  unperturbed  semi-Markov  chain.  For  example,  at  the  tenth  time  step 
the  normalized  probabilities  in  class  1  are 

4ft10)  =  [0-9817,  0.0183] .  (64) 

These  differ  substantially  from  the  class  1  stationary  probabilities  of  the  unperturbed  semi- 
Markov  chain: 

£(1)  =  [0.9500,  0.0500].  (65) 

The  approximate  method  accurately  estimates  the  state  probabilities  when  the  nor¬ 
malized  probabilities  have  converged  to  the  stationary  probabilities  in  each  class.  This 
occurs  as  early  as  time  step  200,  and  the  relative  errors  for  states  1  and  2  have  dropped  to 
Aj  =  Aj  =  8.62  x  10-*%,  which  indicates  that  the  estimate  is  closely  tracking  the  exact 
solution.  Until  time  step  200,  use  of  the  approximate  method  is  not  valid  resulting  in  large 
relative  errors  in  the  state  probabilities. 

Another  source  of  error  is  due  to  non-zero  value  of  e  since  Theorem  4  describes  -$(t) 
in  the  limit  as  e  — »  0.  Obviously,  the  c  chosen  in  Figure  2  was  “small  enough”  because  the 
state  probabilities  were  estimated  adequately.  Figure  3  examines  the  class  2  (or  state  3) 
probability  at  100%,  50%  and  25%  of  an  MTBF  for  a  range  of  values  of  e.  The  relative 
error  decreases  markedly  with  decreasing  e  for  all  three  choices  of  mission  time.  For  large 
e,  (c  >  .01),  the  “slow”  time  scale  represented  by  failure  events  and  the  “fast”  time  scale 
represented  by  fault  monitoring  events  are  nearly  indistinguishable  from  each  other  resulting 
in  poor  estimates  of  the  state  probabilities.  In  contrast,  for  small  e,  («  <  .001)  the  two  time 
scales  are  distinct.  For  e=0. 00005,  the  time  to  a  decision  is  about  36  seconds  and  the 
MTBF  is  4000  seconds,  or,  the  "slow”  time  scale  is  approximately  100  times  slower  than 
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Figure  2:  State  probability  time  histories  for  SCMS-I.  (a)  Analytical 
solutions,  (b)  Relative  error 
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SXNQLE  STEP  FAILURE  PROBABILITY 


Figure  3:  Sensitivity  to  e  for  SCMS-I.  The  relative  error  is  plotted  versus  the  single-step 
probability,  c,  for  mission  times  of  one  MTBF,  0.5  x  MTBF ,  and  0.25  x  MTBF. 


the  fast  time  scale.  Therefore,  to  obtain  accurate  estimates  of  the  state  probabilities,  it 
B  is  imperative  that  the  fast  and  alow  time  scales  be  distinctly  separated  in  terms  of  their 

mean  holding  times.  A  possible  rule  of  thumb  is  suggested  by  these  results  for  determining 
whether  the  time  scales  are  distinct.  That  is,  compute  the  holding  time  of  the  slowest  FDI 
event.  For  the  approximation  to  be  valid,  the  MTBF  of  the  fastest  failure  should  be  at 
least  100  times  longer  than  this  calculated  FDI  holding  time. 

The  analytical  and  approximate  solutions  of  the  class  2  probability  can  also  be  compared 
by  expanding  each  in  a  Taylor  series  about  e  =  0.  If  the  two  are  the  same  to  first  order  in 
*  £  then  the  estimate  is  a  first  order  perturbation  solution.  If  they  differ,  this  would  suggest 

that  an  alternative  estimate  could  be  derived.  Expanding  ;r|(n)  and  &g(n)  Taylor  series 
about  e=0: 

f  "1(n)  =  ne  -  ^(n2  -  n)e2  +  0(e2)  (66) 

*5(»)  =  n€_  ^{n2_2n  ^Ai(e)|  _0}f2  +  o(e2)  (67) 

To  first  order  in  c: 

‘  JTj(n)  =  Tj(n)  =  ne  +  0(e )  (68) 
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Figure  4:  Semi-Markov  transition  diagram  for  SCMS-II 

So,  the  approximation  developed  in  Theorem  4  produces  a  first  order  perturbation 
solution  in  e  for  this  model.  Therefore,  the  error  between  the  analytical  and  approximate 
class  2  probabilities  begins  with  the  order  e2  terms.  Note  that  the  dominant  second  order 
term  (n2e3)  is  also  the  same.  It  can  be  shown  [13]  that  the  error  is  due  to  a  difference  in 
a  second  order  term  with  a  small  coefficient,  namely  a  term  that  is  proportional  to  elapsed 
time.  Although  this  observation  is  strongly  model  dependent,  it  may  also  be  true  for  other 
models  as  well. 

5.3  The  SCMS  with  abbreviated  monitoring 

A  second  method  of  fault  monitoring  is  to  deploy  a  sequential  test  that  monitors  the  status 
of  a  component  until  a  failure  is  indicated,  at  which  point  the  sequential  test  is  discontinued. 
An  SCMS  of  this  type  will  be  denoted  by  SCMS-II. 

The  states  for  the  semi-Markov  model  of  the  SCMS-II  are  enumerated  in  Table  1.  The 
semi-Markov  transition  diagram  of  the  SCMS-II  is  depicted  in  Figure  4.  The  class  decom¬ 
position  of  the  SCMS-II  is  similar  to  SCMS- 1.  However,  in  this  case,  the  embedded  Markov 
chain  in  class  1  is  non-ergodic. 

The  direct  form  of  the  core  matrix  sequence  can  be  developed  in  the  same  manner  as  for 
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the  SCMS-I.  A  notable  difference  is  in  the  transition  probabilities  out  of  state  2.  Because 
the  fault  monitoring  test  is  discontinued  upon  a  failure  indication,  only  failure  events  cause 
such  transitions.  A  reset  of  state  2  occurs  when  no  failure  occurs.  A  transition  from  state  2 
to  state  3  occurs  only  if  a  failure  takes  place.  Assuming  geometrically  distributed  failures, 
the  core  matrix  earn  be  stated: 

'(1  -*)"/£(»)  (1  -*)n/£(n)  e(l-e)n-X<3o(n)  * 

G<(n)=  0  (1  —  e)6(n  —  1)  eS(n-l)  (69) 

0  0  £(n  -  1) 

As  for  the  SCMS-I,  x(n)  can  be  found  using  z-transforms.  The  state  probability  time 
histories  could  not  be  obtained,  however,  because  the  partial  fraction  expansions  could  only 
be  done  numerically.  These  results  are  decribed  fully  in  Appendix  B  of  [13] .  However,  the 
class  probabilities  were  found  and  are  stated  below: 

»•(»)- [(1 1  -  (1  -  €)•]  (70) 

Again,  these  analytical  expressions  for  *(«)  and  £*(n)  will  be  compared  to  the  approximate 
results  derived  using  the  approximate  technique  in  the  next  section. 

To  generate  the  approximate  solutions,  the  core  matrix  must  be  placed  in  standard  form. 
However,  all  of  the  required  quantities  are  known  based  on  the  manipulations  performed  for 
the  SCMS-I.  The  eventual  transition  probability  matrix  of  the  unperturbed  semi-Markov 
chain  is  obtained  by  setting  e  =  0  and  ignoring  the  holding  time  PMFs: 

1  —  Pfa  Pfa  6 

P  =  0  1  0  (71) 

0  1  0 

By  raising  this  matrix  to  successively  higher  powers,  the  stationary  interval  transition  prob¬ 
ability  matrix  can  be  found,  and  the  embedded  stationary  probability  distribution  in  par¬ 
titioned  form  is: 

Km  =  [0  1  |  1]  (72) 

Because  of  the  model  structure,  it  is  clear  that  the  stationary  probabilities  for  each  class  of 
the  unperturbed  semi-Markov  chain  are:  £  =  Km-  F°r  this  analysis,  the  time  scale  factor 

23 


S  ia  again  set  equal  to  e.  Finally,  721  =  1,  and  Ai  =  1,  so  that  the  approximate  expressions 
for  the  class  probabilities  can  be  found: 

**(*)=  exp(-^)  ,  1  -exp(-£)  .  (73) 

By  expanding  the  enlarged  Markov  process  in  terms  of  the  stationary  probabilities  of  the 
unperturbed  semi-Markov  chain,  approximate  expressions  for  the  state  occupancy  proba¬ 
bilities  of  the  original  process  can  be  stated: 

*(*)«  0  exp(-")  ,  1  -  exp(-^)  (74) 

5.4  Discussion  of  results  for  S CMS- II 

The  approximate  state  probability  time  histories,  x(n),  are  compared  to  those  obtained 
analytically,  x(n),  in  Figure  5  for  each  of  the  three  states.  These  results  are  for  the  same 
parameter  set  as  SCMS-I.  The  largest  absolute  errors  occur  in  estimating  state  1  and  do 
not  attenuate  until  50%  of  an  MTBF  has  passed.  The  approximation  estimates  the  state  1 
probability  to  be  zero  because  the  class  1  embedded  Markov  chain  ia  non-ergodic  and  yields 
zero  for  the  stationary  state  1  probability.  The  estimated  state  probabilities  in  states  2  and 
3  are  very  accurate  with  relative  errors  of  less  than  0.01%  for  all  time  steps. 

The  relative  error  in  state  1  is  100%  at  all  times  because  the  normalized  probabilities 
in  class  1  cannot  converge  to  the  stationary  probabilities  of  the  unperturbed  semi-Markov 
chain.  This  is  because  the  state  1  probability  will  never  be  exactly  zero.  For  example,  at 
the  tenth  time  step  in  class  1  the  normalized  state  probabilities  are 

xJ^lO)  =  [0.981175  ,  0.014111],  (75) 

and  the  unperturbed  stationary  probabilities  are: 

£(l)  =  [0  ,  1].  (76) 

The  approximate  method  requires  that  the  normalized  probabilities  converge  to  the  sta¬ 
tionary  probabilities  for  each  class  in  order  to  obtain  accurate  state  probability  estimates. 

The  other  source  of  error  is  due  to  non-zero  e.  In  Figure  5,  the  value  of  e  was  small 
enough  to  provide  accurate  results  because  the  state  2  and  3  probabilities  were  estimated 
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Figure  5:  State  probability  time  histories  for  SCMS-II  (a)  Analytical  and  approximate 
solutions,  (b)  Relative  error. 
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SINGLE  STEP  FAILURE  PROBABILITY 

Figure  6:  Sensitivity  to  e  for  S  CMS- II.  The  relative  error  is  plotted  versus  the  single-step 
probability,  e,  for  mission  times  of  1  MTBF,  0.5  x  MTBF,  and  0.25  x  MTBF. 

adequately.  Figure  6  presents  the  class  2  (or  state  3)  occupancy  probability  for  mission 
times  of  100%,  50%  and  25%  of  an  MTBF  for  a  Tange  of  values  of  e  corresponding  to  a 
component  MTBF  ranging  from  4  seconds  to  5555  hours.  As  was  the  case  for  the  SCMS-I, 
the  relative  error  decreases  markedly  with  decreasing  e  for  the  three  choices  of  mission  time. 
This  reiterates  the  observation  that  the  fast  and  slow  time  scales  must  be  distinct  in  terms 
of  their  mean  holding  times  in  order  to  obtain  accurate  estimates  of  the  state  probabilities. 
This  analysis  also  demonstrates  the  usefulness  of  the  rule  of  thumb  suggested  earlier. 

The  Taylor  series  expansions  for  the  analytical  and  approximate  class  2  probability  will 
again  be  compared.  Expanding  the  class  2  occupancy  probability  in  a  Taylor  series  about 
e  =  0  yields 


7r|(n)  =  n«  -  i(n2  -  n)e2  +  0(e2) 

(77) 

*$(n)  =  ne  -  inV  +  0(e2) 

(78) 

To  first  order,  Xj(n)  and  x|(n)  are  identical.  This  proves  that  the  approximate  method 
produces  a  first  order  perturbation  solution  in  e  for  this  model.  The  two  expressions  begin 
to  differ  starting  with  the  e2  terms,  but  the  dominant  second  order  term  (n2e2)  is  the  same. 
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Hence,  the  error  can  be  expressed  as: 

*5(n)  =  i-it,+0(t1)  (79) 

which  is  second  order  in  e  and  proportional  to  time,  which  emphasizes  the  asymptotic 
nature  of  the  approximation.  Again,  this  observation  is  model  dependent.  However,  the 
same  behavior  was  found  for  the  SCMS-I. 

6  The  SCDR  System  Model 

The  single-component  dual-redundant  (SCDR)  system  consists  of  two  identical  components, 
a  primary  and  a  backup,  operating  in  parallel.  An  independent  sequential  test  monitors 
the  status  of  each  component.  The  reliability  of  this  system  was  evaluated  using  the  ap¬ 
proximate  technique  in  [13].  However,  in  the  interest  of  brevity  and  clarity,  the  interested 
reader  is  referred  to  [13]. 

7  Conclusions 

A  primary  contribution  of  this  work  is  the  extension  of  Korolyuk’s  limit  theorem  for  semi- 
Markov  processes  to  semi-  Markov  chains  in  Theorem  1,  which  describes  the  conditions 
under  which  a  perturbed  semi-Markov  chain  can  be  approximated  by  an  enlarged  Markov 
process.  Moreover,  Theorem  1  describes  how  the  parameters  of  the  enlarged  Markov  process 
are  derived  from  the  parameters  of  the  semi-Markov  chain: 

Two  problems  arise  in  applying  Theorem  1  to  fault  tolerant  control  system  (FTCS) 
models.  First,  the  non-perturbed  embedded  Markov  chains  in  each  class  are  usually  non- 
ergodic.  This  was  required  in  Theorem  1,  but  was  relaxed  to  the  existence  of  the  Caesaro 
limit  probabilities  in  Lemma  2.  These  were  found  to  exist  in  Lemma  3  if  the  embedded 
Markov  chain  was  either  ergodic,  or  non-ergodic  with  one  and  only  one  unity  eigenvalue. 

Second,  the  transition  PMFs  are  typically  not  functions  of  the  perturbation  parameter  e. 
This  problem  was  mitigated  by  introducing  the  concept  of  time  scaling  in  Theorem  4.  The 
form  of  the  transition  PMFs  was  generalized  to  include  those  common  to  FTCS  reliability 
models.  This  generalization  included  a  dependence  on  a  time  scaling  factor  S  and  on  a 
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small  parameter  e  that  determined  the  state  space  partitioning  of  the  original  semi-Markov 
chain. 

Use  of  the  approximate  technique  was  demonstrated  by  two  simple  examples.  Accu¬ 
rate  estimates  of  the  state  probabilities  were  determined  for  situations  where  e  was  ”  small 
enough”  and  where  the  normalized  probabilities  in  each  class  had  converged  to  the  station¬ 
ary  probabilities  of  the  non-perturbed  semi-Markov  chain.  In  the  two  examples  presented, 
the  approximate  technique  yielded  a  first  order  perturbation  solution  in  e  to  the  analytically 
obtained  class  probabilities. 

The  approximation  error  was  found  to  be  insignificant  if  the  slow  and  fast  time  scales 
were  distinct.  Finally,  a  rule  of  thumb  was  suggested  by  the  error  analysis:  the  slow  and 
fast  time  scales  are  distinct  if  the  MTBF  of  the  fastest  failure  is  1000  times  longer  than  the 
I  mean  decision  time  of  the  slowest  FDI  event. 
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